2025-06-06

Title: A Comprehensive Survey on the Risks and Limitations of Concept-based Models

Authors: Sanchit Sinha, Aidong Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04237
Pdf URL: https://arxiv.org/pdf/2506.04237
Copy Paste: [[2506.04237]] A Comprehensive Survey on the Risks and Limitations of Concept-based Models(https://arxiv.org/abs/2506.04237)
Keywords: robust
Abstract: Concept-based Models are a class of inherently explainable networks that improve upon standard Deep Neural Networks by providing a rationale behind their predictions using human-understandable `concepts'. With these models being highly successful in critical applications like medical diagnosis and financial risk prediction, there is a natural push toward their wider adoption in sensitive domains to instill greater trust among diverse stakeholders. However, recent research has uncovered significant limitations in the structure of such networks, their training procedure, underlying assumptions, and their susceptibility to adversarial vulnerabilities. In particular, issues such as concept leakage, entangled representations, and limited robustness to perturbations pose challenges to their reliability and generalization. Additionally, the effectiveness of human interventions in these models remains an open question, raising concerns about their real-world applicability. In this paper, we provide a comprehensive survey on the risks and limitations associated with Concept-based Models. In particular, we focus on aggregating commonly encountered challenges and the architecture choices mitigating these challenges for Supervised and Unsupervised paradigms. We also examine recent advances in improving their reliability and discuss open problems and promising avenues of future research in this domain.

Title: Improving Out-of-Distribution Detection with Markov Logic Networks

Authors: Konstantin Kirchheim, Frank Ortmeier
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04241
Pdf URL: https://arxiv.org/pdf/2506.04241
Copy Paste: [[2506.04241]] Improving Out-of-Distribution Detection with Markov Logic Networks(https://arxiv.org/abs/2506.04241)
Keywords: explainability
Abstract: Out-of-distribution (OOD) detection is essential for ensuring the reliability of deep learning models operating in open-world scenarios. Current OOD detectors mainly rely on statistical models to identify unusual patterns in the latent representations of a deep neural network. This work proposes to augment existing OOD detectors with probabilistic reasoning, utilizing Markov logic networks (MLNs). MLNs connect first-order logic with probabilistic reasoning to assign probabilities to inputs based on weighted logical constraints defined over human-understandable concepts, which offers improved explainability. Through extensive experiments on multiple datasets, we demonstrate that MLNs can significantly enhance the performance of a wide range of existing OOD detectors while maintaining computational efficiency. Furthermore, we introduce a simple algorithm for learning logical constraints for OOD detection from a dataset and showcase its effectiveness.

Title: Triple Attention Transformer Architecture for Time-Dependent Concrete Creep Prediction

Authors: Warayut Dokduea, Weerachart Tangchirapat, Sompote Youwai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04243
Pdf URL: https://arxiv.org/pdf/2506.04243
Copy Paste: [[2506.04243]] Triple Attention Transformer Architecture for Time-Dependent Concrete Creep Prediction(https://arxiv.org/abs/2506.04243)
Keywords: interpretability, transformer
Abstract: This paper presents a novel Triple Attention Transformer Architecture for predicting time-dependent concrete creep, addressing fundamental limitations in current approaches that treat time as merely an input parameter rather than modeling the sequential nature of deformation development. By transforming concrete creep prediction into an autoregressive sequence modeling task similar to language processing, our architecture leverages the transformer's self-attention mechanisms to capture long-range dependencies in historical creep patterns. The model implements a triple-stream attention framework incorporating temporal attention for sequential progression, feature attention for material property interactions, and batch attention for inter-sample relationships. Evaluated on experimental datasets with standardized daily measurements spanning 160 days, the architecture achieves exceptional performance with mean absolute percentage error of 1.63% and R2 values of 0.999 across all datasets, substantially outperforming traditional empirical models and existing machine learning approaches. Ablation studies confirm the critical role of attention mechanisms, with attention pooling contributing most significantly to model performance. SHAP analysis reveals Young's modulus as the primary predictive feature, followed by density and compressive strength, providing interpretability essential for engineering applications. A deployed web-based interface facilitates practical implementation, enabling real-time predictions using standard laboratory parameters. This work establishes the viability of applying transformer architectures to materials science problems, demonstrating the potential for data-driven approaches to revolutionize structural behavior prediction and engineering design practices.

Title: SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Authors: Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, Christopher Parisien
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04250
Pdf URL: https://arxiv.org/pdf/2506.04250
Copy Paste: [[2506.04250]] SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs(https://arxiv.org/abs/2506.04250)
Keywords: interpretability, large language model
Abstract: Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We showcase the effectiveness of our approach across various LLMs, datasets, and risk categories, demonstrating its ability to provide precise control, prevent blanket refusals, and guide models toward generating safe content while maintaining topic relevance.

Title: Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

Authors: Alan Mitkiy, James Smith, Hana Satou, Hiroshi Tanaka, Emily Johnson, F Monkey
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04263
Pdf URL: https://arxiv.org/pdf/2506.04263
Copy Paste: [[2506.04263]] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training(https://arxiv.org/abs/2506.04263)
Keywords: robust
Abstract: Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.

Title: RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

Authors: Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, Wenbo Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04277
Pdf URL: https://arxiv.org/pdf/2506.04277
Copy Paste: [[2506.04277]] RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought(https://arxiv.org/abs/2506.04277)
Keywords: large language model, segmentation
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.

Title: Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Authors: Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tianshuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xiangchao Meng, Yuxin Zhang, Yanlin Li, Lei Ren, Wei Chen, Zhiyuan Huang, Mingjie Zhan, Xiaojie Wang, Fangxiang Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04280
Pdf URL: https://arxiv.org/pdf/2506.04280
Copy Paste: [[2506.04280]] Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark(https://arxiv.org/abs/2506.04280)
Keywords: large language model
Abstract: With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $\textbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $\textbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $\textbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.

Title: SF$^2$Bench: Evaluating Data-Driven Models for Compound Flood Forecasting in South Florida

Authors: Xu Zheng, Chaohao Lin, Sipeng Chen, Zhuomin Chen, Jimeng Shi, Wei Cheng, Jayantha Obeysekera, Jason Liu, Dongsheng Luo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04281
Pdf URL: https://arxiv.org/pdf/2506.04281
Copy Paste: [[2506.04281]] SF$^2$Bench: Evaluating Data-Driven Models for Compound Flood Forecasting in South Florida(https://arxiv.org/abs/2506.04281)
Keywords: transformer, large language model
Abstract: Forecasting compound floods presents a significant challenge due to the intricate interplay of meteorological, hydrological, and oceanographic factors. Analyzing compound floods has become more critical as the global climate increases flood risks. Traditional physics-based methods, such as the Hydrologic Engineering Center's River Analysis System, are often time-inefficient. Machine learning has recently demonstrated promise in both modeling accuracy and computational efficiency. However, the scarcity of comprehensive datasets currently hinders systematic analysis. Existing water-related datasets are often limited by a sparse network of monitoring stations and incomplete coverage of relevant factors. To address this challenge, we introduce SF2Bench, a comprehensive time series collection on compound floods in South Florida, which integrates four key factors: tide, rainfall, groundwater, and human management activities (gate and pump controlling). This integration allows for a more detailed analysis of the individual contributions of these drivers to compound flooding and informs the development of improved flood forecasting approaches. To comprehensively evaluate the potential of various modeling paradigms, we assess the performance of six categories of methods, encompassing Multilayer Perceptrons, Convolutional Neural Networks, Recurrent Neural Networks, Graph Neural Networks, Transformers, and Large Language Models. We verified the impact of different key features on flood forecasting through experiments. Our analysis examines temporal and spatial aspects, providing insights into the influence of historical data and spatial dependencies. The varying performance across these approaches underscores the diverse capabilities of each in capturing complex temporal and spatial dependencies inherent in compound floods.

Title: DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience

Authors: Runxiang Wang, Boxiao Wang, Kai Li, Yifan Zhang, Jian Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04282
Pdf URL: https://arxiv.org/pdf/2506.04282
Copy Paste: [[2506.04282]] DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience(https://arxiv.org/abs/2506.04282)
Keywords: robust, large language model
Abstract: Symbolic regression is a fundamental tool for discovering interpretable mathematical expressions from data, with broad applications across scientific and engineering domains. Recently, large language models (LLMs) have demonstrated strong performance in this task, leveraging embedded scientific priors and reasoning capabilities to surpass traditional methods. However, existing LLM-based approaches, such as LLM-SR, often over-rely on internal priors, lacking explicit data understanding and systematic reflection during equation generation. To address these limitations, we propose DrSR (Dual Reasoning Symbolic Regression), a framework that combines data-driven insight with reflective learning to enhance both robustness and discovery capability. Specifically, DrSR guides LLMs to analyze structural relationships (e.g., monotonicity, nonlinearity, and correlation) within the data to generate structured descriptions. Simultaneously, it monitors equation performance and establishes a feedback loop to refine subsequent generations. By integrating data understanding and generation reflection in a closed loop, DrSR enables more efficient exploration of the symbolic expression space. Experiments across interdisciplinary datasets in physics, chemistry, biology, and materials science demonstrate that DrSR substantially improves the valid equation rate and consistently outperforms both classical and recent LLM-based methods in terms of accuracy, generalization, and search efficiency. These results underscore its potential for scientific equation discovery.

Title: Backbone Augmented Training for Adaptations

Authors: Jae Wan Park, Junhyeok Kim, Youngjun Jun, Hyunah Ko, Seong Jae Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04288
Pdf URL: https://arxiv.org/pdf/2506.04288
Copy Paste: [[2506.04288]] Backbone Augmented Training for Adaptations(https://arxiv.org/abs/2506.04288)
Keywords: diffusion, transformer
Abstract: Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the backbone models. We propose Backbone Augmented Training (BAT), a method that leverages backbone data to augment the adaptation dataset. First, we formulate and prove two mathematical key propositions: one establishes the validity of BAT, while the other identifies a condition under which BAT benefits adaptation. Furthermore, we introduce an advanced data selection scheme that satisfies these propositions and present ALBAT algorithm to implement this approach. ALBAT efficiently enhances adaptation training in both personalization and language generation tasks with scarce data.

Title: Relational reasoning and inductive bias in transformers trained on a transitive inference task

Authors: Jesse Geerts, Stephanie Chan, Claudia Clopath, Kimberly Stachenfeld
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.04289
Pdf URL: https://arxiv.org/pdf/2506.04289
Copy Paste: [[2506.04289]] Relational reasoning and inductive bias in transformers trained on a transitive inference task(https://arxiv.org/abs/2506.04289)
Keywords: transformer
Abstract: Transformer-based models have demonstrated remarkable reasoning abilities, but the mechanisms underlying relational reasoning in different learning regimes remain poorly understood. In this work, we investigate how transformers perform a classic relational reasoning task from the Psychology literature, \textit{transitive inference}, which requires inference about indirectly related items by integrating information across observed adjacent item pairs (e.g., if A>B and B>C, then A>C). We compare transitive inference behavior across two distinct learning regimes: in-weights learning (IWL), where models store information in network parameters, and in-context learning (ICL), where models flexibly utilize information presented within the input sequence. Our findings reveal that IWL naturally induces a generalization bias towards transitive inference, despite being trained only on adjacent items, whereas ICL models trained solely on adjacent items do not generalize transitively. Mechanistic analysis shows that ICL models develop induction circuits that implement a simple match-and-copy strategy that performs well at relating adjacent pairs, but does not encoding hierarchical relationships among indirectly related items. Interestingly, when pre-trained on in-context linear regression tasks, transformers successfully exhibit in-context generalizable transitive inference. Moreover, like IWL, they display both \textit{symbolic distance} and \textit{terminal item effects} characteristic of human and animal performance, without forming induction circuits. These results suggest that pre-training on tasks with underlying structure promotes the development of representations that can scaffold in-context relational reasoning.

Title: AUTOCT: Automating Interpretable Clinical Trial Prediction with LLM Agents

Authors: Fengze Liu, Haoyu Wang, Joonhyuk Cho, Dan Roth, Andrew W. Lo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04293
Pdf URL: https://arxiv.org/pdf/2506.04293
Copy Paste: [[2506.04293]] AUTOCT: Automating Interpretable Clinical Trial Prediction with LLM Agents(https://arxiv.org/abs/2506.04293)
Keywords: interpretability, explainability, large language model
Abstract: Clinical trials are critical for advancing medical treatments but remain prohibitively expensive and time-consuming. Accurate prediction of clinical trial outcomes can significantly reduce research and development costs and accelerate drug discovery. While recent deep learning models have shown promise by leveraging unstructured data, their black-box nature, lack of interpretability, and vulnerability to label leakage limit their practical use in high-stakes biomedical contexts. In this work, we propose AutoCT, a novel framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning. AutoCT autonomously generates, evaluates, and refines tabular features based on public information without human input. Our method uses Monte Carlo Tree Search to iteratively optimize predictive performance. Experimental results show that AutoCT performs on par with or better than SOTA methods on clinical trial prediction tasks within only a limited number of self-refinement iterations, establishing a new paradigm for scalable, interpretable, and cost-efficient clinical trial prediction.

Title: Deep learning for predicting hauling fleet production capacity under uncertainties in open pit mines using real and simulated data

Authors: N Guerin (CGS i3), M Nakhla (CGS i3), A Dehoux (ERAMET), J L Loyer (ERAMET)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04296
Pdf URL: https://arxiv.org/pdf/2506.04296
Copy Paste: [[2506.04296]] Deep learning for predicting hauling fleet production capacity under uncertainties in open pit mines using real and simulated data(https://arxiv.org/abs/2506.04296)
Keywords: robust
Abstract: Accurate short-term forecasting of hauling-fleet capacity is crucial in open-pit mining, where weather fluctuations, mechanical breakdowns, and variable crew availability introduce significant operational uncertainties. We propose a deep-learning framework that blends real-world operational records (high-resolution rainfall measurements, fleet performance telemetry) with synthetically generated mechanical-breakdown scenarios to enable the model to capture fluctuating high-impact failure events. We evaluate two architectures: an XGBoost regressor achieving a median absolute error (MedAE) of 14.3 per cent and a Long Short-Term Memory network with a MedAE of 15.1 per cent. Shapley Additive exPlanations (SHAP) value analyses identify cumulative rainfall, historical payload trends, and simulated breakdown frequencies as dominant predictors. Integration of simulated breakdown data and shift-planning features notably reduces prediction volatility. Future work will further integrate maintenance-scheduling indicators (Mean Time Between Failures, Mean Time to Repair), detailed human resource data (operator absenteeism, crew efficiency metrics), blast event scheduling, and other operational constraints to enhance forecast robustness and adaptability. This hybrid modelling approach offers a comprehensive decision-support tool for proactive, data-driven fleet management under dynamically uncertain conditions.

Title: RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming

Authors: Xiang Zheng, Xingjun Ma, Wei-Bin Lee, Cong Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04302
Pdf URL: https://arxiv.org/pdf/2506.04302
Copy Paste: [[2506.04302]] RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming(https://arxiv.org/abs/2506.04302)
Keywords: large language model
Abstract: Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and sentence diversity metrics, featuring modularized intrinsic reward computation that facilitates plug-and-play experimentation. To clarify their influence on RFT performance, we conducted an extensive ablation study on key components, including Low-Rank Adaptation (LoRA), Kullback-Leibler (KL) divergence, and Lagrange Multiplier. We hope this work contributes to 1) gaining a comprehensive understanding of the implementation nuances of RFT-based red teaming algorithms, and 2) enabling rapid prototyping of innovative features for RFT-based red teaming. Code for the benchmark can be accessed at this https URL.

Title: GEM: Empowering LLM for both Embedding Generation and Language Understanding

Authors: Caojin Zhang, Qiang Zhang, Ke Li, Sai Vidyaranya Nuthalapati, Benyu Zhang, Jason Liu, Serena Li, Lizhu Zhang, Xiangjun Fan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04344
Pdf URL: https://arxiv.org/pdf/2506.04344
Copy Paste: [[2506.04344]] GEM: Empowering LLM for both Embedding Generation and Language Understanding(https://arxiv.org/abs/2506.04344)
Keywords: generative, large language model
Abstract: Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG), still rely on separate embedding models to generate text embeddings, which can complicate the system and introduce discrepancies in understanding of the query between the embedding model and LLMs. To address this limitation, we propose a simple self-supervised approach, Generative Embedding large language Model (GEM), that enables any large decoder-only LLM to generate high-quality text embeddings while maintaining its original text generation and reasoning capabilities. Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask. This method could be easily integrated into post-training or fine tuning stages of any existing LLMs. We demonstrate the effectiveness of our approach by applying it to two popular LLM families, ranging from 1B to 8B parameters, and evaluating the transformed models on both text embedding benchmarks (MTEB) and NLP benchmarks (MMLU). The results show that our proposed method significantly improves the original LLMs on MTEB while having a minimal impact on MMLU. Our strong results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance

Title: You Only Train Once

Authors: Christos Sakaridis
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04349
Pdf URL: https://arxiv.org/pdf/2506.04349
Copy Paste: [[2506.04349]] You Only Train Once(https://arxiv.org/abs/2506.04349)
Keywords: segmentation
Abstract: The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used brute-force grid search across state-of-the-art networks solving two key problems in computer vision, i.e. 3D estimation and semantic segmentation, and showing that it consistently outperforms the best grid-search model on unseen test data. Code will be made publicly available.

Title: HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting

Authors: Maksym Ivashechkin, Oscar Mendez, Richard Bowden
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04351
Pdf URL: https://arxiv.org/pdf/2506.04351
Copy Paste: [[2506.04351]] HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting(https://arxiv.org/abs/2506.04351)
Keywords: diffusion, transformer, generative
Abstract: 3D human generation is an important problem with a wide range of applications in computer vision and graphics. Despite recent progress in generative AI such as diffusion models or rendering methods like Neural Radiance Fields or Gaussian Splatting, controlling the generation of accurate 3D humans from text prompts remains an open challenge. Current methods struggle with fine detail, accurate rendering of hands and faces, human realism, and controlability over appearance. The lack of diversity, realism, and annotation in human image data also remains a challenge, hindering the development of a foundational 3D human model. We present a weakly supervised pipeline that tries to address these challenges. In the first step, we generate a photorealistic human image dataset with controllable attributes such as appearance, race, gender, etc using a state-of-the-art image diffusion model. Next, we propose an efficient mapping approach from image features to 3D point clouds using a transformer-based architecture. Finally, we close the loop by training a point-cloud diffusion model that is conditioned on the same text prompts used to generate the original samples. We demonstrate orders-of-magnitude speed-ups in 3D human generation compared to the state-of-the-art approaches, along with significantly improved text-prompt alignment, realism, and rendering quality. We will make the code and dataset available.

Title: ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Authors: Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, Pranav Rajpurkar
Subjects: cs.CV, cs.AI, cs.CE, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04353
Pdf URL: https://arxiv.org/pdf/2506.04353
Copy Paste: [[2506.04353]] ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding(https://arxiv.org/abs/2506.04353)
Keywords: large language model
Abstract: We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at this https URL

Title: A Risk-Aware Reinforcement Learning Reward for Financial Trading

Authors: Uditansh Srivastava, Shivam Aryan, Shaurya Singh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04358
Pdf URL: https://arxiv.org/pdf/2506.04358
Copy Paste: [[2506.04358]] A Risk-Aware Reinforcement Learning Reward for Financial Trading(https://arxiv.org/abs/2506.04358)
Keywords: robust
Abstract: We propose a novel composite reward function for reinforcement learning in financial trading that balances return and risk using four differentiable terms: annualized return downside risk differential return and the Treynor ratio Unlike single metric objectives for example the Sharpe ratio our formulation is modular and parameterized by weights w1 w2 w3 and w4 enabling practitioners to encode diverse investor preferences We tune these weights via grid search to target specific risk return profiles We derive closed form gradients for each term to facilitate gradient based training and analyze key theoretical properties including monotonicity boundedness and modularity This framework offers a general blueprint for building robust multi objective reward functions in complex trading environments and can be extended with additional risk measures or adaptive weighting

Title: WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Authors: Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, Pascale Fung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04363
Pdf URL: https://arxiv.org/pdf/2506.04363
Copy Paste: [[2506.04363]] WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning(https://arxiv.org/abs/2506.04363)
Keywords: robust, generative
Abstract: Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide "action equivalents" - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.

Title: Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

Authors: Zheng-Xin Yong, Vineel Pratap, Michael Auli, Jean Maillard
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04364
Pdf URL: https://arxiv.org/pdf/2506.04364
Copy Paste: [[2506.04364]] Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR(https://arxiv.org/abs/2506.04364)
Keywords: robust
Abstract: To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data -- the number of speakers, the audio duration per each individual speaker, and the diversity of accents -- affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that practitioners should prioritize increasing the speaker count in ASR training data composition for new languages.

Title: Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

Authors: Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04367
Pdf URL: https://arxiv.org/pdf/2506.04367
Copy Paste: [[2506.04367]] Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks(https://arxiv.org/abs/2506.04367)
Keywords: robust, transformer
Abstract: Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

Title: Mechanistic Decomposition of Sentence Representations

Authors: Matthieu Tehenan, Vikram Natarajan, Jonathan Michala, Milton Lin, Juri Opitz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04373
Pdf URL: https://arxiv.org/pdf/2506.04373
Copy Paste: [[2506.04373]] Mechanistic Decomposition of Sentence Representations(https://arxiv.org/abs/2506.04373)
Keywords: interpretability
Abstract: Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.

Title: Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization

Authors: Matthew W. Shinkle, Mark D. Lescroart
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.04379
Pdf URL: https://arxiv.org/pdf/2506.04379
Copy Paste: [[2506.04379]] Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization(https://arxiv.org/abs/2506.04379)
Keywords: generative
Abstract: Deep neural networks (DNNs) trained on visual tasks develop feature representations that resemble those in the human visual system. Although DNN-based encoding models can accurately predict brain responses to visual stimuli, they offer limited insight into the specific features driving these responses. Here, we demonstrate that activation maximization -- a technique designed to interpret vision DNNs -- can be applied to DNN-based encoding models of the human brain. We extract and adaptively downsample activations from multiple layers of a pretrained Inception V3 network, then use linear regression to predict fMRI responses. This yields a full image-computable model of brain responses. Next, we apply activation maximization to generate images optimized for predicted responses in individual cortical voxels. We find that these images contain visual characteristics that qualitatively correspond with known selectivity and enable exploration of selectivity across the visual cortex. We further extend our method to whole regions of interest (ROIs) of the brain and validate its efficacy by presenting these images to human participants in an fMRI study. We find that the generated images reliably drive activity in targeted regions across both low- and high-level visual areas and across subjects. These results demonstrate that activation maximization can be successfully applied to DNN-based encoding models. By addressing key limitations of alternative approaches that require natively generative models, our approach enables flexible characterization and modulation of responses across the human visual system.

Title: The Hashed Fractal Key Recovery (HFKR) Problem: From Symbolic Path Inversion to Post-Quantum Cryptographic Keys

Authors: Mohamed Aly Bouke
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.04383
Pdf URL: https://arxiv.org/pdf/2506.04383
Copy Paste: [[2506.04383]] The Hashed Fractal Key Recovery (HFKR) Problem: From Symbolic Path Inversion to Post-Quantum Cryptographic Keys(https://arxiv.org/abs/2506.04383)
Keywords: secure, security, attack, diffusion
Abstract: Classical cryptographic systems rely heavily on structured algebraic problems, such as factorization, discrete logarithms, or lattice-based assumptions, which are increasingly vulnerable to quantum attacks and structural cryptanalysis. In response, this work introduces the Hashed Fractal Key Recovery (HFKR) problem, a non-algebraic cryptographic construction grounded in symbolic dynamics and chaotic perturbations. HFKR builds on the Symbolic Path Inversion Problem (SPIP), leveraging symbolic trajectories generated via contractive affine maps over $\mathbb{Z}^2$, and compressing them into fixed-length cryptographic keys using hash-based obfuscation. A key contribution of this paper is the empirical confirmation that these symbolic paths exhibit fractal behavior, quantified via box counting dimension, path geometry, and spatial density measures. The observed fractal dimension increases with trajectory length and stabilizes near 1.06, indicating symbolic self-similarity and space-filling complexity, both of which reinforce the entropy foundation of the scheme. Experimental results across 250 perturbation trials show that SHA3-512 and SHAKE256 amplify symbolic divergence effectively, achieving mean Hamming distances near 255, ideal bit-flip rates, and negligible entropy deviation. In contrast, BLAKE3 exhibits statistically uniform but weaker diffusion. These findings confirm that HFKR post-quantum security arises from the synergy between symbolic fractality and hash-based entropy amplification. The resulting construction offers a lightweight, structure-free foundation for secure key generation in adversarial settings without relying on algebraic hardness assumptions.

Title: MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP

Authors: Kurt Micallef, Claudia Borg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04385
Pdf URL: https://arxiv.org/pdf/2506.04385
Copy Paste: [[2506.04385]] MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP(https://arxiv.org/abs/2506.04385)
Keywords: generative, large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend that researchers working with low-resource languages consider more "traditional" language modelling approaches.

Title: Through the Stealth Lens: Rethinking Attacks and Defenses in RAG

Authors: Sarthak Choudhary, Nils Palumbo, Ashish Hooda, Krishnamurthy Dj Dvijotham, Somesh Jha
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04390
Pdf URL: https://arxiv.org/pdf/2506.04390
Copy Paste: [[2506.04390]] Through the Stealth Lens: Rethinking Attacks and Defenses in RAG(https://arxiv.org/abs/2506.04390)
Keywords: security, defense, attack, steal
Abstract: Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved set, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize stealth using a distinguishability-based security game. If a few poisoned passages are designed to control the response, they must differentiate themselves from benign ones, inherently compromising stealth. This motivates the need for attackers to rigorously analyze intermediate signals involved in generation$\unicode{x2014}$such as attention patterns or next-token probability distributions$\unicode{x2014}$to avoid easily detectable traces of manipulation. Leveraging attention patterns, we propose a passage-level score$\unicode{x2014}$the Normalized Passage Attention Score$\unicode{x2014}$used by our Attention-Variance Filter algorithm to identify and filter potentially poisoned passages. This method mitigates existing attacks, improving accuracy by up to $\sim 20 \%$ over baseline defenses. To probe the limits of attention-based defenses, we craft stealthier adaptive attacks that obscure such traces, achieving up to $35 \%$ attack success rate, and highlight the challenges in improving stealth.

Title: Is Perturbation-Based Image Protection Disruptive to Image Editing?

Authors: Qiuyu Tang, Bonor Ayambem, Mooi Choo Chuah, Aparna Bharati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04394
Pdf URL: https://arxiv.org/pdf/2506.04394
Copy Paste: [[2506.04394]] Is Perturbation-Based Image Protection Disruptive to Image Editing?(https://arxiv.org/abs/2506.04394)
Keywords: protect, robust, diffusion
Abstract: The remarkable image generation capabilities of state-of-the-art diffusion models, such as Stable Diffusion, can also be misused to spread misinformation and plagiarize copyrighted materials. To mitigate the potential risks associated with image editing, current image protection methods rely on adding imperceptible perturbations to images to obstruct diffusion-based editing. A fully successful protection for an image implies that the output of editing attempts is an undesirable, noisy image which is completely unrelated to the reference image. In our experiments with various perturbation-based image protection methods across multiple domains (natural scene images and artworks) and editing tasks (image-to-image generation and style editing), we discover that such protection does not achieve this goal completely. In most scenarios, diffusion-based editing of protected images generates a desirable output image which adheres precisely to the guidance prompt. Our findings suggest that adding noise to images may paradoxically increase their association with given text prompts during the generation process, leading to unintended consequences such as better resultant edits. Hence, we argue that perturbation-based methods may not provide a sufficient solution for robust image protection against diffusion-based editing.

Title: Normalize Filters! Classical Wisdom for Deep Vision

Authors: Gustavo Perez, Stella X. Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04401
Pdf URL: https://arxiv.org/pdf/2506.04401
Copy Paste: [[2506.04401]] Normalize Filters! Classical Wisdom for Deep Vision(https://arxiv.org/abs/2506.04401)
Keywords: robust, interpretability, transformer
Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

Title: MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

Authors: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04405
Pdf URL: https://arxiv.org/pdf/2506.04405
Copy Paste: [[2506.04405]] MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale(https://arxiv.org/abs/2506.04405)
Keywords: privacy, large language model
Abstract: We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.

Title: Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

Authors: Wesley Scivetti, Tatsuya Aoyama, Ethan Wilcox, Nathan Schneider
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04408
Pdf URL: https://arxiv.org/pdf/2506.04408
Copy Paste: [[2506.04408]] Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning(https://arxiv.org/abs/2506.04408)
Keywords: transformer
Abstract: Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.

Title: HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Authors: Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher Ré, David W. Romero
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04421
Pdf URL: https://arxiv.org/pdf/2506.04421
Copy Paste: [[2506.04421]] HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation(https://arxiv.org/abs/2506.04421)
Keywords: diffusion
Abstract: Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

Title: Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Authors: Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Pavel Plyusnin, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2506.04430
Pdf URL: https://arxiv.org/pdf/2506.04430
Copy Paste: [[2506.04430]] Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order(https://arxiv.org/abs/2506.04430)
Keywords: large language model
Abstract: Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at this https URL

Title: RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis

Authors: Robin Yadav, Qi Yan, Guy Wolf, Avishek Joey Bose, Renjie Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04439
Pdf URL: https://arxiv.org/pdf/2506.04439
Copy Paste: [[2506.04439]] RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis(https://arxiv.org/abs/2506.04439)
Keywords: generative
Abstract: A fundamental problem in organic chemistry is identifying and predicting the series of reactions that synthesize a desired target product molecule. Due to the combinatorial nature of the chemical search space, single-step reactant prediction -- i.e. single-step retrosynthesis -- remains challenging even for existing state-of-the-art template-free generative approaches to produce an accurate yet diverse set of feasible reactions. In this paper, we model single-step retrosynthesis planning and introduce RETRO SYNFLOW (RSF) a discrete flow-matching framework that builds a Markov bridge between the prescribed target product molecule and the reactant molecule. In contrast to past approaches, RSF employs a reaction center identification step to produce intermediate structures known as synthons as a more informative source distribution for the discrete flow. To further enhance diversity and feasibility of generated samples, we employ Feynman-Kac steering with Sequential Monte Carlo based resampling to steer promising generations at inference using a new reward oracle that relies on a forward-synthesis model. Empirically, we demonstrate \nameshort achieves $60.0 \%$ top-1 accuracy, which outperforms the previous SOTA by $20 \%$. We also substantiate the benefits of steering at inference and demonstrate that FK-steering improves top-$5$ round-trip accuracy by $19 \%$ over prior template-free SOTA methods, all while preserving competitive top-$k$ accuracy results.

Title: Selective Matching Losses -- Not All Scores Are Created Equal

Authors: Gil I. Shamir, Manfred K. Warmuth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04446
Pdf URL: https://arxiv.org/pdf/2506.04446
Copy Paste: [[2506.04446]] Selective Matching Losses -- Not All Scores Are Created Equal(https://arxiv.org/abs/2506.04446)
Keywords: large language model
Abstract: Learning systems match predicted scores to observations over some domain. Often, it is critical to produce accurate predictions in some subset (or region) of the domain, yet less important to accurately predict in other regions. We construct selective matching loss functions by design of increasing link functions over score domains. A matching loss is an integral over the link. A link defines loss sensitivity as function of the score, emphasizing high slope high sensitivity regions over flat ones. Loss asymmetry drives a model and resolves its underspecification to predict better in high sensitivity regions where it is more important, and to distinguish between high and low importance regions. A large variety of selective scalar losses can be designed with scaled and shifted Sigmoid and hyperbolic sine links. Their properties, however, do not extend to multi-class. Applying them per dimension lacks ranking sensitivity that assigns importance according to class score ranking. Utilizing composite Softmax functions, we develop a framework for multidimensional selective losses. We overcome limitations of the standard Softmax function, that is good for classification, but not for distinction between adjacent scores. Selective losses have substantial advantage over traditional losses in applications with more important score regions, including dwell-time prediction, retrieval, ranking with either pointwise, contrastive pairwise, or listwise losses, distillation problems, and fine-tuning alignment of Large Language Models (LLMs).

Title: Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

Authors: Payel Bhattacharjee, Fengwei Tian, Ravi Tandon, Joseph Lo, Heidi Hanson, Geoffrey Rubin, Nirav Merchant, John Gounley
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04450
Pdf URL: https://arxiv.org/pdf/2506.04450
Copy Paste: [[2506.04450]] Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification(https://arxiv.org/abs/2506.04450)
Keywords: privacy, protect, large language model
Abstract: Purpose: This study proposes a framework for fine-tuning large language models (LLMs) with differential privacy (DP) to perform multi-abnormality classification on radiology report text. By injecting calibrated noise during fine-tuning, the framework seeks to mitigate the privacy risks associated with sensitive patient data and protect against data leakage while maintaining classification performance. Materials and Methods: We used 50,232 radiology reports from the publicly available MIMIC-CXR chest radiography and CT-RATE computed tomography datasets, collected between 2011 and 2019. Fine-tuning of LLMs was conducted to classify 14 labels from MIMIC-CXR dataset, and 18 labels from CT-RATE dataset using Differentially Private Low-Rank Adaptation (DP-LoRA) in high and moderate privacy regimes (across a range of privacy budgets = {0.01, 0.1, 1.0, 10.0}). Model performance was evaluated using weighted F1 score across three model architectures: BERT-medium, BERT-small, and ALBERT-base. Statistical analyses compared model performance across different privacy levels to quantify the privacy-utility trade-off. Results: We observe a clear privacy-utility trade-off through our experiments on 2 different datasets and 3 different models. Under moderate privacy guarantees the DP fine-tuned models achieved comparable weighted F1 scores of 0.88 on MIMIC-CXR and 0.59 on CT-RATE, compared to non-private LoRA baselines of 0.90 and 0.78, respectively. Conclusion: Differentially private fine-tuning using LoRA enables effective and privacy-preserving multi-abnormality classification from radiology reports, addressing a key challenge in fine-tuning LLMs on sensitive medical data.

Title: Neurosymbolic Artificial Intelligence for Robust Network Intrusion Detection: From Scratch to Transfer Learning

Authors: Huynh T. T. Tran, Jacob Sander, Achraf Cohen, Brian Jalaian, Nathaniel D. Bastian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04454
Pdf URL: https://arxiv.org/pdf/2506.04454
Copy Paste: [[2506.04454]] Neurosymbolic Artificial Intelligence for Robust Network Intrusion Detection: From Scratch to Transfer Learning(https://arxiv.org/abs/2506.04454)
Keywords: security, protect, robust, extraction, interpretability
Abstract: Network Intrusion Detection Systems (NIDS) play a vital role in protecting digital infrastructures against increasingly sophisticated cyber threats. In this paper, we extend ODXU, a Neurosymbolic AI (NSAI) framework that integrates deep embedded clustering for feature extraction, symbolic reasoning using XGBoost, and comprehensive uncertainty quantification (UQ) to enhance robustness, interpretability, and generalization in NIDS. The extended ODXU incorporates score-based methods (e.g., Confidence Scoring, Shannon Entropy) and metamodel-based techniques, including SHAP values and Information Gain, to assess the reliability of predictions. Experimental results on the CIC-IDS-2017 dataset show that ODXU outperforms traditional neural models across six evaluation metrics, including classification accuracy and false omission rate. While transfer learning has seen widespread adoption in fields such as computer vision and natural language processing, its potential in cybersecurity has not been thoroughly explored. To bridge this gap, we develop a transfer learning strategy that enables the reuse of a pre-trained ODXU model on a different dataset. Our ablation study on ACI-IoT-2023 demonstrates that the optimal transfer configuration involves reusing the pre-trained autoencoder, retraining the clustering module, and fine-tuning the XGBoost classifier, and outperforms traditional neural models when trained with as few as 16,000 samples (approximately 50% of the training data). Additionally, results show that metamodel-based UQ methods consistently outperform score-based approaches on both datasets.

Title: Zero-Shot Open-Schema Entity Structure Discovery

Authors: Xueqiang Xu, Jinfeng Xiao, James Barry, Mohab Elkaref, Jiaru Zou, Pengcheng Jiang, Yunyi Zhang, Max Giammona, Geeth de Mel, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04458
Pdf URL: https://arxiv.org/pdf/2506.04458
Copy Paste: [[2506.04458]] Zero-Shot Open-Schema Entity Structure Discovery(https://arxiv.org/abs/2506.04458)
Keywords: extraction, large language model
Abstract: Entity structure extraction, which aims to extract entities and their associated attribute-value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce Zero-Shot Open-schema Entity Structure Discovery (ZOES), a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs' ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.

Title: Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey

Authors: Ivan Vegner, Sydelle de Souza, Valentin Forch, Martha Lewis, Leonidas A.A. Doumas
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.04461
Pdf URL: https://arxiv.org/pdf/2506.04461
Copy Paste: [[2506.04461]] Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey(https://arxiv.org/abs/2506.04461)
Keywords: interpretability
Abstract: A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley's (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.

Title: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Authors: Apurv Verma, NhatHai Phan, Shubhendu Trivedi
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04462
Pdf URL: https://arxiv.org/pdf/2506.04462
Copy Paste: [[2506.04462]] Watermarking Degrades Alignment in Language Models: Analysis and Mitigation(https://arxiv.org/abs/2506.04462)
Keywords: robust, watermark, large language model
Abstract: Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.

Title: Aligning Large Language Models with Implicit Preferences from User-Generated Content

Authors: Zhaoxuan Tan, Zheng Li, Tianyi Liu, Haodong Wang, Hyokun Yun, Ming Zeng, Pei Chen, Zhihan Zhang, Yifan Gao, Ruijie Wang, Priyanka Nigam, Bing Yin, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04463
Pdf URL: https://arxiv.org/pdf/2506.04463
Copy Paste: [[2506.04463]] Aligning Large Language Models with Implicit Preferences from User-Generated Content(https://arxiv.org/abs/2506.04463)
Keywords: robust, large language model
Abstract: Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at this https URL

Title: Classifying Dental Care Providers Through Machine Learning with Features Ranking

Authors: Mohammad Subhi Al-Batah, Mowafaq Salem Alzboon, Muhyeeddin Alqaraleh, Mohammed Hasan Abu-Arqoub, Rashiq Rafiq Marie
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04474
Pdf URL: https://arxiv.org/pdf/2506.04474
Copy Paste: [[2506.04474]] Classifying Dental Care Providers Through Machine Learning with Features Ranking(https://arxiv.org/abs/2506.04474)
Keywords: robust
Abstract: This study investigates the application of machine learning (ML) models for classifying dental providers into two categories - standard rendering providers and safety net clinic (SNC) providers - using a 2018 dataset of 24,300 instances with 20 features. The dataset, characterized by high missing values (38.1%), includes service counts (preventive, treatment, exams), delivery systems (FFS, managed care), and beneficiary demographics. Feature ranking methods such as information gain, Gini index, and ANOVA were employed to identify critical predictors, revealing treatment-related metrics (TXMT_USER_CNT, TXMT_SVC_CNT) as top-ranked features. Twelve ML models, including k-Nearest Neighbors (kNN), Decision Trees, Support Vector Machines (SVM), Stochastic Gradient Descent (SGD), Random Forest, Neural Networks, and Gradient Boosting, were evaluated using 10-fold cross-validation. Classification accuracy was tested across incremental feature subsets derived from rankings. The Neural Network achieved the highest accuracy (94.1%) using all 20 features, followed by Gradient Boosting (93.2%) and Random Forest (93.0%). Models showed improved performance as more features were incorporated, with SGD and ensemble methods demonstrating robustness to missing data. Feature ranking highlighted the dominance of treatment service counts and annotation codes in distinguishing provider types, while demographic variables (AGE_GROUP, CALENDAR_YEAR) had minimal impact. The study underscores the importance of feature selection in enhancing model efficiency and accuracy, particularly in imbalanced healthcare datasets. These findings advocate for integrating feature-ranking techniques with advanced ML algorithms to optimize dental provider classification, enabling targeted resource allocation for underserved populations.

Title: SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL

Authors: Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, Tim Kraska
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04494
Pdf URL: https://arxiv.org/pdf/2506.04494
Copy Paste: [[2506.04494]] SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL(https://arxiv.org/abs/2506.04494)
Keywords: large language model
Abstract: Text-to-SQL systems translate natural language (NL) questions into SQL queries, enabling non-technical users to interact with structured data. While large language models (LLMs) have shown promising results on the text-to-SQL task, they often produce semantically incorrect yet syntactically valid queries, with limited insight into their reliability. We propose SQLens, an end-to-end framework for fine-grained detection and correction of semantic errors in LLM-generated SQL. SQLens integrates error signals from both the underlying database and the LLM to identify potential semantic errors within SQL clauses. It further leverages these signals to guide query correction. Empirical results on two public benchmarks show that SQLens outperforms the best LLM-based self-evaluation method by 25.78% in F1 for error detection, and improves execution accuracy of out-of-the-box text-to-SQL systems by up to 20%.

Title: Towards Large-Scale Pose-Invariant Face Recognition Using Face Defrontalization

Authors: Patrik Mesec, Alan Jović
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04496
Pdf URL: https://arxiv.org/pdf/2506.04496
Copy Paste: [[2506.04496]] Towards Large-Scale Pose-Invariant Face Recognition Using Face Defrontalization(https://arxiv.org/abs/2506.04496)
Keywords: extraction
Abstract: Face recognition under extreme head poses is a challenging task. Ideally, a face recognition system should perform well across different head poses, which is known as pose-invariant face recognition. To achieve pose invariance, current approaches rely on sophisticated methods, such as face frontalization and various facial feature extraction model architectures. However, these methods are somewhat impractical in real-life settings and are typically evaluated on small scientific datasets, such as Multi-PIE. In this work, we propose the inverse method of face frontalization, called face defrontalization, to augment the training dataset of facial feature extraction model. The method does not introduce any time overhead during the inference step. The method is composed of: 1) training an adapted face defrontalization FFWM model on a frontal-profile pairs dataset, which has been preprocessed using our proposed face alignment method; 2) training a ResNet-50 facial feature extraction model based on ArcFace loss on a raw and randomly defrontalized large-scale dataset, where defrontalization was performed with our previously trained face defrontalization model. Our method was compared with the existing approaches on four open-access datasets: LFW, AgeDB, CFP, and Multi-PIE. Defrontalization shows improved results compared to models without defrontalization, while the proposed adjustments show clear superiority over the state-of-the-art face frontalization FFWM method on three larger open-access datasets, but not on the small Multi-PIE dataset for extreme poses (75 and 90 degrees). The results suggest that at least some of the current methods may be overfitted to small datasets.

Title: FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Authors: Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04499
Pdf URL: https://arxiv.org/pdf/2506.04499
Copy Paste: [[2506.04499]] FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices(https://arxiv.org/abs/2506.04499)
Keywords: transformer
Abstract: Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

Title: AuthGuard: Generalizable Deepfake Detection via Language Guidance

Authors: Guangyu Shen, Zhihua Li, Xiang Xu, Tianchen Zhao, Zheng Zhang, Dongsheng An, Zhuowen Tu, Yifan Xing, Qin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04501
Pdf URL: https://arxiv.org/pdf/2506.04501
Copy Paste: [[2506.04501]] AuthGuard: Generalizable Deepfake Detection via Language Guidance(https://arxiv.org/abs/2506.04501)
Keywords: robust
Abstract: Existing deepfake detection techniques struggle to keep-up with the ever-evolving novel, unseen forgeries methods. This limitation stems from their reliance on statistical artifacts learned during training, which are often tied to specific generation processes that may not be representative of samples from new, unseen deepfake generation methods encountered at test time. We propose that incorporating language guidance can improve deepfake detection generalization by integrating human-like commonsense reasoning -- such as recognizing logical inconsistencies and perceptual anomalies -- alongside statistical cues. To achieve this, we train an expert deepfake vision encoder by combining discriminative classification with image-text contrastive learning, where the text is generated by generalist MLLMs using few-shot prompting. This allows the encoder to extract both language-describable, commonsense deepfake artifacts and statistical forgery artifacts from pixel-level distributions. To further enhance robustness, we integrate data uncertainty learning into vision-language contrastive learning, mitigating noise in image-text supervision. Our expert vision encoder seamlessly interfaces with an LLM, further enabling more generalized and interpretable deepfake detection while also boosting accuracy. The resulting framework, AuthGuard, achieves state-of-the-art deepfake detection accuracy in both in-distribution and out-of-distribution settings, achieving AUC gains of 6.15% on the DFDC dataset and 16.68% on the DF40 dataset. Additionally, AuthGuard significantly enhances deepfake reasoning, improving performance by 24.69% on the DDVQA dataset.

Title: Pruning Everything, Everywhere, All at Once

Authors: Gustavo Henrique do Nascimento, Ian Pons, Anna Helena Reali Costa, Artur Jordao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04513
Pdf URL: https://arxiv.org/pdf/2506.04513
Copy Paste: [[2506.04513]] Pruning Everything, Everywhere, All at Once(https://arxiv.org/abs/2506.04513)
Keywords: robust
Abstract: Deep learning stands as the modern paradigm for solving cognitive tasks. However, as the problem complexity increases, models grow deeper and computationally prohibitive, hindering advancements in real-world and resource-constrained applications. Extensive studies reveal that pruning structures in these models efficiently reduces model complexity and improves computational efficiency. Successful strategies in this sphere include removing neurons (i.e., filters, heads) or layers, but not both together. Therefore, simultaneously pruning different structures remains an open problem. To fill this gap and leverage the benefits of eliminating neurons and layers at once, we propose a new method capable of pruning different structures within a model as follows. Given two candidate subnetworks (pruned models), one from layer pruning and the other from neuron pruning, our method decides which to choose by selecting the one with the highest representation similarity to its parent (the network that generates the subnetworks) using the Centered Kernel Alignment metric. Iteratively repeating this process provides highly sparse models that preserve the original predictive ability. Throughout extensive experiments on standard architectures and benchmarks, we confirm the effectiveness of our approach and show that it outperforms state-of-the-art layer and filter pruning techniques. At high levels of Floating Point Operations reduction, most state-of-the-art methods degrade accuracy, whereas our approach either improves it or experiences only a minimal drop. Notably, on the popular ResNet56 and ResNet110, we achieve a milestone of 86.37% and 95.82% FLOPs reduction. Besides, our pruned models obtain robustness to adversarial and out-of-distribution samples and take an important step towards GreenAI, reducing carbon emissions by up to 83.31%. Overall, we believe our work opens a new chapter in pruning.

Title: DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation

Authors: Kun Zhao, Bohao Yang, Chen Tang, Siyuan Dai, Haoteng Tang, Chenghua Lin, Liang Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04516
Pdf URL: https://arxiv.org/pdf/2506.04516
Copy Paste: [[2506.04516]] DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation(https://arxiv.org/abs/2506.04516)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM's scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.

Title: Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation

Authors: Di Wu, Seth Aycock, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04521
Pdf URL: https://arxiv.org/pdf/2506.04521
Copy Paste: [[2506.04521]] Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation(https://arxiv.org/abs/2506.04521)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textit{Translating Step-by-step}~\citep{briakou2024translating}, for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24. In this work, we scrutinise this strategy's effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process, at least for the models on test; and we show that simply prompting LLMs to ``translate again'' yields even better results than human-like step-by-step prompting. Our analysis does not rule out the role of reasoning, but instead invites future work exploring the factors for CoT's effectiveness in the context of translation.

Title: Perturbative Gradient Training: A novel training paradigm for bridging the gap between deep neural networks and physical reservoir computing

Authors: Cliff B. Abbott, Mark Elo, Dmytro A. Bozhko
Subjects: cs.LG, cond-mat.mes-hall, cs.ET, cs.NE, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.04523
Pdf URL: https://arxiv.org/pdf/2506.04523
Copy Paste: [[2506.04523]] Perturbative Gradient Training: A novel training paradigm for bridging the gap between deep neural networks and physical reservoir computing(https://arxiv.org/abs/2506.04523)
Keywords: transformer
Abstract: We introduce Perturbative Gradient Training (PGT), a novel training paradigm that overcomes a critical limitation of physical reservoir computing: the inability to perform backpropagation due to the black-box nature of physical reservoirs. Drawing inspiration from perturbation theory in physics, PGT uses random perturbations in the network's parameter space to approximate gradient updates using only forward passes. We demonstrate the feasibility of this approach on both simulated neural network architectures, including a dense network and a transformer model with a reservoir layer, and on experimental hardware using a magnonic auto-oscillation ring as the physical reservoir. Our results show that PGT can achieve performance comparable to that of standard backpropagation methods in cases where backpropagation is impractical or impossible. PGT represents a promising step toward integrating physical reservoirs into deeper neural network architectures and achieving significant energy efficiency gains in AI training.

Title: EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

Authors: Shuo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04526
Pdf URL: https://arxiv.org/pdf/2506.04526
Copy Paste: [[2506.04526]] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention(https://arxiv.org/abs/2506.04526)
Keywords: robust, transformer
Abstract: Crack detection on road surfaces is a critical measurement technology in the instrumentation domain, essential for ensuring infrastructure safety and transportation reliability. However, due to limited energy and low-resolution imaging, smart terminal devices struggle to maintain real-time monitoring performance. To overcome these challenges, this paper proposes a multi-stage detection approach for road crack detection, EECD-Net, to enhance accuracy and energy efficiency of instrumentation. Specifically, the sophisticated Super-Resolution Convolutional Neural Network (SRCNN) is employed to address the inherent challenges of low-quality images, which effectively enhance image resolution while preserving critical structural details. Meanwhile, a Spike Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is proposed to convert these images into sparse pulse sequences, significantly reducing power consumption. Additionally, a Gated Attention Transformer (GAT) module is designed to strategically fuse multi-scale feature representations through adaptive attention mechanisms, effectively capturing both long-range dependencies and intricate local crack patterns, and significantly enhancing detection robustness across varying crack morphologies. The experiments on the CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6\% detection accuracy, surpassing state-of-the-art counterparts such as Hybrid-Segmentor by a significant 1.5\%. Notably, the EECD-Net maintains exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial 33\% reduction compared to baseline implementations. This work pioneers a transformative approach in instrumentation-based crack detection, offering a scalable, low-power solution for real-time, large-scale infrastructure monitoring in resource-constrained environments.

Title: HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

Authors: Geon-Woo Kim, Junbo Li, Shashidhar Gandham, Omar Baldonado, Adithya Gangidi, Pavan Balaji, Zhangyang Wang, Aditya Akella
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04531
Pdf URL: https://arxiv.org/pdf/2506.04531
Copy Paste: [[2506.04531]] HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training(https://arxiv.org/abs/2506.04531)
Keywords: large language model
Abstract: Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.

Title: Neural MJD: Neural Non-Stationary Merton Jump Diffusion for Time Series Prediction

Authors: Yuanpei Gao, Qi Yan, Yan Leng, Renjie Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04542
Pdf URL: https://arxiv.org/pdf/2506.04542
Copy Paste: [[2506.04542]] Neural MJD: Neural Non-Stationary Merton Jump Diffusion for Time Series Prediction(https://arxiv.org/abs/2506.04542)
Keywords: diffusion
Abstract: While deep learning methods have achieved strong performance in time series prediction, their black-box nature and inability to explicitly model underlying stochastic processes often limit their generalization to non-stationary data, especially in the presence of abrupt changes. In this work, we introduce Neural MJD, a neural network based non-stationary Merton jump diffusion (MJD) model. Our model explicitly formulates forecasting as a stochastic differential equation (SDE) simulation problem, combining a time-inhomogeneous Itô diffusion to capture non-stationary stochastic dynamics with a time-inhomogeneous compound Poisson process to model abrupt jumps. To enable tractable learning, we introduce a likelihood truncation mechanism that caps the number of jumps within small time intervals and provide a theoretical error bound for this approximation. Additionally, we propose an Euler-Maruyama with restart solver, which achieves a provably lower error bound in estimating expected states and reduced variance compared to the standard solver. Experiments on both synthetic and real-world datasets demonstrate that Neural MJD consistently outperforms state-of-the-art deep learning and statistical learning methods.

Title: Communication Efficient Adaptive Model-Driven Quantum Federated Learning

Authors: Dev Gurung, Shiva Raj Pokhrel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04548
Pdf URL: https://arxiv.org/pdf/2506.04548
Copy Paste: [[2506.04548]] Communication Efficient Adaptive Model-Driven Quantum Federated Learning(https://arxiv.org/abs/2506.04548)
Keywords: federate
Abstract: Training with huge datasets and a large number of participating devices leads to bottlenecks in federated learning (FL). Furthermore, the challenges of heterogeneity between multiple FL clients affect the overall performance of the system. In a quantum federated learning (QFL) context, we address these three main challenges: i) training bottlenecks from massive datasets, ii) the involvement of a substantial number of devices, and iii) non-IID data distributions. We introduce a model-driven quantum federated learning algorithm (mdQFL) to tackle these challenges. Our proposed approach is efficient and adaptable to various factors, including different numbers of devices. To the best of our knowledge, it is the first to explore training and update personalization, as well as test generalization within a QFL setting, which can be applied to other FL scenarios. We evaluated the efficiency of the proposed mdQFL framework through extensive experiments under diverse non-IID data heterogeneity conditions using various datasets within the Qiskit environment. Our results demonstrate a nearly 50% decrease in total communication costs while maintaining or, in some cases, exceeding the accuracy of the final model and consistently improving local model training compared to the standard QFL baseline. Moreover, our experimental evaluation thoroughly explores the QFL and mdQFL algorithms, along with several influencing factors. In addition, we present a theoretical analysis to clarify the complexities of the proposed algorithm. The experimental code is available at 1.

Title: Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices

Authors: Andersen Chang, Tiffany M. Tang, Tarek M. Zikry, Genevera I. Allen
Subjects: cs.LG, stat.AP, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2506.04553
Pdf URL: https://arxiv.org/pdf/2506.04553
Copy Paste: [[2506.04553]] Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices(https://arxiv.org/abs/2506.04553)
Keywords: robust
Abstract: Unsupervised machine learning is widely used to mine large, unlabeled datasets to make data-driven discoveries in critical domains such as climate science, biomedicine, astronomy, chemistry, and more. However, despite its widespread utilization, there is a lack of standardization in unsupervised learning workflows for making reliable and reproducible scientific discoveries. In this paper, we present a structured workflow for using unsupervised learning techniques in science. We highlight and discuss best practices starting with formulating validatable scientific questions, conducting robust data preparation and exploration, using a range of modeling techniques, performing rigorous validation by evaluating the stability and generalizability of unsupervised learning conclusions, and promoting effective communication and documentation of results to ensure reproducible scientific discoveries. To illustrate our proposed workflow, we present a case study from astronomy, seeking to refine globular clusters of Milky Way stars based upon their chemical composition. Our case study highlights the importance of validation and illustrates how the benefits of a carefully-designed workflow for unsupervised learning can advance scientific discovery.

Title: BESA: Boosting Encoder Stealing Attack with Perturbation Recovery

Authors: Xuhao Ren, Haotian Liang, Yajie Wang, Chuan Zhang, Zehui Xiong, Liehuang Zhu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04556
Pdf URL: https://arxiv.org/pdf/2506.04556
Copy Paste: [[2506.04556]] BESA: Boosting Encoder Stealing Attack with Perturbation Recovery(https://arxiv.org/abs/2506.04556)
Keywords: defense, attack, steal, generative
Abstract: To boost the encoder stealing attack under the perturbation-based defense that hinders the attack performance, we propose a boosting encoder stealing attack with perturbation recovery named BESA. It aims to overcome perturbation-based defenses. The core of BESA consists of two modules: perturbation detection and perturbation recovery, which can be combined with canonical encoder stealing attacks. The perturbation detection module utilizes the feature vectors obtained from the target encoder to infer the defense mechanism employed by the service provider. Once the defense mechanism is detected, the perturbation recovery module leverages the well-designed generative model to restore a clean feature vector from the perturbed one. Through extensive evaluations based on various datasets, we demonstrate that BESA significantly enhances the surrogate encoder accuracy of existing encoder stealing attacks by up to 24.63\% when facing state-of-the-art defenses and combinations of multiple defenses.

Title: Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Authors: Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04559
Pdf URL: https://arxiv.org/pdf/2506.04559
Copy Paste: [[2506.04559]] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning(https://arxiv.org/abs/2506.04559)
Keywords: large language model
Abstract: Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.

Title: Clustering and Median Aggregation Improve Differentially Private Inference

Authors: Kareem Amin, Salman Avestimehr, Sara Babakniya, Alex Bie, Weiwei Kong, Natalia Ponomareva, Umar Syed
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.04566
Pdf URL: https://arxiv.org/pdf/2506.04566
Copy Paste: [[2506.04566]] Clustering and Median Aggregation Improve Differentially Private Inference(https://arxiv.org/abs/2506.04566)
Keywords: privacy, large language model
Abstract: Differentially private (DP) language model inference is an approach for generating private synthetic text. A sensitive input example is used to prompt an off-the-shelf large language model (LLM) to produce a similar example. Multiple examples can be aggregated together to formally satisfy the DP guarantee. Prior work creates inference batches by sampling sensitive inputs uniformly at random. We show that uniform sampling degrades the quality of privately generated text, especially when the sensitive examples concern heterogeneous topics. We remedy this problem by clustering the input data before selecting inference batches. Next, we observe that clustering also leads to more similar next-token predictions across inferences. We use this insight to introduce a new algorithm that aggregates next token statistics by privately computing medians instead of averages. This approach leverages the fact that the median has decreased local sensitivity when next token predictions are similar, allowing us to state a data-dependent and ex-post DP guarantee about the privacy properties of this algorithm. Finally, we demonstrate improvements in terms of representativeness metrics (e.g., MAUVE) as well as downstream task performance. We show that our method produces high-quality synthetic data at significantly lower privacy cost than a previous state-of-the-art method.

Title: StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Authors: Ranjith Merugu, Bryan Bo Cao, Shubham Jain
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04567
Pdf URL: https://arxiv.org/pdf/2506.04567
Copy Paste: [[2506.04567]] StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation(https://arxiv.org/abs/2506.04567)
Keywords: robust
Abstract: Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient prediction; (2) It employs a lightweight learner StatsMergeLearner to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces Task-Specific Teacher Distillation for merging vision models with heterogeneous architectures, a merging learning paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to StatsMergeLearner; and (b) distilling knowledge from models with heterogeneous architectures prior to merging. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.

Title: Demonstrations of Integrity Attacks in Multi-Agent Systems

Authors: Can Zheng, Yuhan Cao, Xiaoning Dong, Tianxing He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04572
Pdf URL: https://arxiv.org/pdf/2506.04572
Copy Paste: [[2506.04572]] Demonstrations of Integrity Attacks in Multi-Agent Systems(https://arxiv.org/abs/2506.04572)
Keywords: security, attack, robust, large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textit{Scapegoater}, who misleads the system monitor to underestimate other agents' contributions; \textit{Boaster}, who misleads the system monitor to overestimate their own performance; \textit{Self-Dealer}, who manipulates other agents to adopt certain tools; and \textit{Free-Rider}, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.

Title: Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis

Authors: Dimitris Vamvourellis, Dhagash Mehta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04574
Pdf URL: https://arxiv.org/pdf/2506.04574
Copy Paste: [[2506.04574]] Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis(https://arxiv.org/abs/2506.04574)
Keywords: large language model
Abstract: We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive "System 1"-like thinking aligns more closely with human judgment compared to "System 2"-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.

Title: Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

Authors: Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04575
Pdf URL: https://arxiv.org/pdf/2506.04575
Copy Paste: [[2506.04575]] Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?(https://arxiv.org/abs/2506.04575)
Keywords: large language model
Abstract: Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs' ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through **logic-invariant lexical diversification**. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs' ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at this https URL.

Title: Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching

Authors: Jianfei Zhang, Bei Li, Jun Bai, Rumei Li, Yanmeng Wang, Chenghua Lin, Wenge Rong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04579
Pdf URL: https://arxiv.org/pdf/2506.04579
Copy Paste: [[2506.04579]] Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching(https://arxiv.org/abs/2506.04579)
Keywords: large language model
Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.

Title: SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing

Authors: Hongjun Liu, Yilun Zhao, Arman Cohan, Chen Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04583
Pdf URL: https://arxiv.org/pdf/2506.04583
Copy Paste: [[2506.04583]] SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing(https://arxiv.org/abs/2506.04583)
Keywords: segmentation
Abstract: Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.

Title: LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models

Authors: Wen Ding, Fan Qian
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04586
Pdf URL: https://arxiv.org/pdf/2506.04586
Copy Paste: [[2506.04586]] LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models(https://arxiv.org/abs/2506.04586)
Keywords: large language model
Abstract: We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. Within the LESS framework, pseudo-labeled text from Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST) of the unsupervised data is refined by an LLM, and augmented by a data filtering strategy to optimize LLM knowledge transfer efficiency. Experiments on both Mandarin ASR and Spanish-to-English AST tasks show that LESS achieves a notable absolute WER reduction of 3.77% on the Wenet Speech test set, as well as BLEU scores of 34.0 and 64.7 on Callhome and Fisher test sets respectively. These results validate the adaptability of LESS across different languages, tasks, and domains. Ablation studies conducted with various LLMs and prompt configurations provide novel insights into leveraging LLM-derived knowledge for speech processing applications.

Title: Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

Authors: Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04590
Pdf URL: https://arxiv.org/pdf/2506.04590
Copy Paste: [[2506.04590]] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting(https://arxiv.org/abs/2506.04590)
Keywords: robust, generative
Abstract: We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model's generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.

Title: Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification

Authors: Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, Ming Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04592
Pdf URL: https://arxiv.org/pdf/2506.04592
Copy Paste: [[2506.04592]] Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification(https://arxiv.org/abs/2506.04592)
Keywords: robust, large language model
Abstract: Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.

Title: Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Authors: Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04598
Pdf URL: https://arxiv.org/pdf/2506.04598
Copy Paste: [[2506.04598]] Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets(https://arxiv.org/abs/2506.04598)
Keywords: robust, generative, segmentation
Abstract: In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at this https URL.

Title: A MISMATCHED Benchmark for Scientific Natural Language Inference

Authors: Firoz Shaik, Mobashir Sadat, Nikita Gautam, Doina Caragea, Cornelia Caragea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04603
Pdf URL: https://arxiv.org/pdf/2506.04603
Copy Paste: [[2506.04603]] A MISMATCHED Benchmark for Scientific Natural Language Inference(https://arxiv.org/abs/2506.04603)
Keywords: large language model
Abstract: Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.

Title: SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

Authors: Alexander Huang-Menders, Xinhang Liu, Andy Xu, Yuyao Zhang, Chi-Keung Tang, Yu-Wing Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04606
Pdf URL: https://arxiv.org/pdf/2506.04606
Copy Paste: [[2506.04606]] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents(https://arxiv.org/abs/2506.04606)
Keywords: diffusion
Abstract: SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.

Title: Ignoring Directionality Leads to Compromised Graph Neural Network Explanations

Authors: Changsheng Sun, Xinke Li, Jin Song Dong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04608
Pdf URL: https://arxiv.org/pdf/2506.04608
Copy Paste: [[2506.04608]] Ignoring Directionality Leads to Compromised Graph Neural Network Explanations(https://arxiv.org/abs/2506.04608)
Keywords: security, explainability
Abstract: Graph Neural Networks (GNNs) are increasingly used in critical domains, where reliable explanations are vital for supporting human decision-making. However, the common practice of graph symmetrization discards directional information, leading to significant information loss and misleading explanations. Our analysis demonstrates how this practice compromises explanation fidelity. Through theoretical and empirical studies, we show that preserving directional semantics significantly improves explanation quality, ensuring more faithful insights for human decision-makers. These findings highlight the need for direction-aware GNN explainability in security-critical applications.

Title: Exploring bidirectional bounds for minimax-training of Energy-based models

Authors: Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, Søren Hauberg
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04609
Pdf URL: https://arxiv.org/pdf/2506.04609
Copy Paste: [[2506.04609]] Exploring bidirectional bounds for minimax-training of Energy-based models(https://arxiv.org/abs/2506.04609)
Keywords: diffusion, generative
Abstract: Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.

Title: Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning

Authors: Ho-Lam Chung, Teng-Yun Hsiao, Hsiao-Ying Huang, Chunerh Cho, Jian-Ren Lin, Zhang Ziwei, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04611
Pdf URL: https://arxiv.org/pdf/2506.04611
Copy Paste: [[2506.04611]] Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning(https://arxiv.org/abs/2506.04611)
Keywords: generative, large language model
Abstract: Test-Time Scaling (TTS) improves the reasoning performance of Large Language Models (LLMs) by allocating additional compute during inference. We conduct a structured survey of TTS methods and categorize them into sampling-based, search-based, and trajectory optimization strategies. We observe that reasoning-optimized models often produce less diverse outputs, which limits TTS effectiveness. To address this, we propose ADAPT (A Diversity Aware Prefix fine-Tuning), a lightweight method that applies prefix tuning with a diversity-focused data strategy. Experiments on mathematical reasoning tasks show that ADAPT reaches 80% accuracy using eight times less compute than strong baselines. Our findings highlight the essential role of generative diversity in maximizing TTS effectiveness.

Title: Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth

Authors: Jinyoung Jun, Lei Chu, Jiahao Li, Yan Lu, Chang-Su Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04612
Pdf URL: https://arxiv.org/pdf/2506.04612
Copy Paste: [[2506.04612]] Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth(https://arxiv.org/abs/2506.04612)
Keywords: robust, diffusion
Abstract: We propose a novel two-stage framework for sensor depth enhancement, called Perfecting Depth. This framework leverages the stochastic nature of diffusion models to automatically detect unreliable depth regions while preserving geometric cues. In the first stage (stochastic estimation), the method identifies unreliable measurements and infers geometric structure by leveraging a training-inference domain gap. In the second stage (deterministic refinement), it enforces structural consistency and pixel-level accuracy using the uncertainty map derived from the first stage. By combining stochastic uncertainty modeling with deterministic refinement, our method yields dense, artifact-free depth maps with improved reliability. Experimental results demonstrate its effectiveness across diverse real-world scenarios. Furthermore, theoretical analysis, various experiments, and qualitative visualizations validate its robustness and scalability. Our framework sets a new baseline for sensor depth enhancement, with potential applications in autonomous driving, robotics, and immersive technologies.

Title: Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Authors: Likun Cao, Rui Pan, James Evans
Subjects: cs.CL, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2506.04616
Pdf URL: https://arxiv.org/pdf/2506.04616
Copy Paste: [[2506.04616]] Subjective Perspectives within Learned Representations Predict High-Impact Innovation(https://arxiv.org/abs/2506.04616)
Keywords: large language model
Abstract: Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators' perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy.

Title: Deep Learning Reforms Image Matching: A Survey and Outlook

Authors: Shihua Zhang, Zizhuo Li, Kaining Zhang, Yifan Lu, Yuxin Deng, Linfeng Tang, Xingyu Jiang, Jiayi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04619
Pdf URL: https://arxiv.org/pdf/2506.04619
Copy Paste: [[2506.04619]] Deep Learning Reforms Image Matching: A Survey and Outlook(https://arxiv.org/abs/2506.04619)
Keywords: robust
Abstract: Image matching, which establishes correspondences between two-view images to recover 3D structure and camera geometry, serves as a cornerstone in computer vision and underpins a wide range of applications, including visual localization, 3D reconstruction, and simultaneous localization and mapping (SLAM). Traditional pipelines composed of ``detector-descriptor, feature matcher, outlier filter, and geometric estimator'' falter in challenging scenarios. Recent deep-learning advances have significantly boosted both robustness and accuracy. This survey adopts a unique perspective by comprehensively reviewing how deep learning has incrementally transformed the classical image matching pipeline. Our taxonomy highly aligns with the traditional pipeline in two key aspects: i) the replacement of individual steps in the traditional pipeline with learnable alternatives, including learnable detector-descriptor, outlier filter, and geometric estimator; and ii) the merging of multiple steps into end-to-end learnable modules, encompassing middle-end sparse matcher, end-to-end semi-dense/dense matcher, and pose regressor. We first examine the design principles, advantages, and limitations of both aspects, and then benchmark representative methods on relative pose recovery, homography estimation, and visual localization tasks. Finally, we discuss open challenges and outline promising directions for future research. By systematically categorizing and evaluating deep learning-driven strategies, this survey offers a clear overview of the evolving image matching landscape and highlights key avenues for further innovation.

Title: Static Word Embeddings for Sentence Semantic Representation

Authors: Takashi Wada, Yuki Hirakawa, Ryotaro Shimizu, Takahiro Kawashima, Yuki Saito
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04624
Pdf URL: https://arxiv.org/pdf/2506.04624
Copy Paste: [[2506.04624]] Static Word Embeddings for Sentence Semantic Representation(https://arxiv.org/abs/2506.04624)
Keywords: transformer
Abstract: We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even rivals a basic Sentence Transformer model (SimCSE) on some data sets. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are irrelevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.

Title: Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Authors: Zhiyuan Ma, Jiayu Liu, Xianzhen Luo, Zhenya Huang, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04625
Pdf URL: https://arxiv.org/pdf/2506.04625
Copy Paste: [[2506.04625]] Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning(https://arxiv.org/abs/2506.04625)
Keywords: large language model
Abstract: Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error -> Reflection -> Correction" learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.

Title: Composing Agents to Minimize Worst-case Risk

Authors: Guruprerana Shabadi, Rajeev Alur
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04632
Pdf URL: https://arxiv.org/pdf/2506.04632
Copy Paste: [[2506.04632]] Composing Agents to Minimize Worst-case Risk(https://arxiv.org/abs/2506.04632)
Keywords: privacy, fair
Abstract: From software development to robot control, modern agentic systems decompose complex objectives into a sequence of subtasks and choose a set of specialized AI agents to complete them. We formalize an agentic workflow as a directed acyclic graph, called an agent graph, where edges represent AI agents and paths correspond to feasible compositions of agents. When deploying these systems in the real world, we need to choose compositions of agents that not only maximize the task success, but also minimize risk where the risk captures requirements like safety, fairness, and privacy. This additionally requires carefully analyzing the low-probability (tail) behaviors of compositions of agents. In this work, we consider worst-case risk minimization over the set of feasible agent compositions. We define worst-case risk as the tail quantile -- also known as value-at-risk -- of the loss distribution of the agent composition where the loss quantifies the risk associated with agent behaviors. We introduce an efficient algorithm that traverses the agent graph and finds a near-optimal composition of agents by approximating the value-at-risk via a union bound and dynamic programming. Furthermore, we prove that the approximation is near-optimal asymptotically for a broad class of practical loss functions. To evaluate our framework, we consider a suite of video game-like control benchmarks that require composing several agents trained with reinforcement learning and demonstrate our algorithm's effectiveness in approximating the value-at-risk and identifying the optimal agent composition.

Title: Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Authors: Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, Ranjay Krishna
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04633
Pdf URL: https://arxiv.org/pdf/2506.04633
Copy Paste: [[2506.04633]] Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations(https://arxiv.org/abs/2506.04633)
Keywords: large language model
Abstract: Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

Title: Incentivizing Collaborative Breach Detection

Authors: Mridu Nanda, Michael K. Reiter
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.04634
Pdf URL: https://arxiv.org/pdf/2506.04634
Copy Paste: [[2506.04634]] Incentivizing Collaborative Breach Detection(https://arxiv.org/abs/2506.04634)
Keywords: attack
Abstract: Decoy passwords, or "honeywords," alert a site to its breach if they are ever entered in a login attempt on that site. However, an attacker can identify a user-chosen password from among the decoys, without risk of alerting the site to its breach, by performing credential stuffing, i.e., entering the stolen passwords at another site where the same user reused her password. Prior work has thus proposed that sites monitor for the entry of their honeywords at other sites. Unfortunately, it is not clear what incentives sites have to participate in this monitoring. In this paper we propose and evaluate an algorithm by which sites can exchange monitoring favors. Through a model-checking analysis, we show that using our algorithm, a site improves its ability to detect its own breach when it increases the monitoring effort it expends for other sites. We additionally quantify the impacts of various parameters on detection effectiveness and their implications for the deployment of a system to support a monitoring ecosystem. Finally, we evaluate our algorithm on a real dataset of breached credentials and provide a performance analysis that confirms its scalability and practical viability.

Title: ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition

Authors: Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04635
Pdf URL: https://arxiv.org/pdf/2506.04635
Copy Paste: [[2506.04635]] ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition(https://arxiv.org/abs/2506.04635)
Keywords: robust
Abstract: Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

Title: Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

Authors: Qiming Hu, Linlong Fan, Yiyan Luo, Yuhang Yu, Xiaojie Guo, Qingnan Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04641
Pdf URL: https://arxiv.org/pdf/2506.04641
Copy Paste: [[2506.04641]] Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders(https://arxiv.org/abs/2506.04641)
Keywords: diffusion, generative, segmentation
Abstract: The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{this https URL}{here}.

Title: TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Authors: Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04642
Pdf URL: https://arxiv.org/pdf/2506.04642
Copy Paste: [[2506.04642]] TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering(https://arxiv.org/abs/2506.04642)
Keywords: transformer, large language model
Abstract: The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.

Title: Authenticated Private Set Intersection: A Merkle Tree-Based Approach for Enhancing Data Integrity

Authors: Zixian Gong, Zhiyong Zheng, Zhe Hu, Kun Tian, Yi Zhang, Zhedanov Oleksiy, Fengxia Liu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.04647
Pdf URL: https://arxiv.org/pdf/2506.04647
Copy Paste: [[2506.04647]] Authenticated Private Set Intersection: A Merkle Tree-Based Approach for Enhancing Data Integrity(https://arxiv.org/abs/2506.04647)
Keywords: secure, privacy, attack
Abstract: Private Set Intersection (PSI) enables secure computation of set intersections while preserving participant privacy, standard PSI existing protocols remain vulnerable to data integrity attacks allowing malicious participants to extract additional intersection information or mislead other parties. In this paper, we propose the definition of data integrity in PSI and construct two authenticated PSI schemes by integrating Merkle Trees with state-of-the-art two-party volePSI and multi-party mPSI protocols. The resulting two-party authenticated PSI achieves communication complexity $\mathcal{O}(n \lambda+n \log n)$, aligning with the best-known unauthenticated PSI schemes, while the multi-party construction is $\mathcal{O}(n \kappa+n \log n)$ which introduces additional overhead due to Merkle tree inclusion proofs. Due to the incorporation of integrity verification, our authenticated schemes incur higher costs compared to state-of-the-art unauthenticated schemes. We also provide efficient implementations of our protocols and discuss potential improvements, including alternative authentication blocks.

Title: FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

Authors: Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, Bohan Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04648
Pdf URL: https://arxiv.org/pdf/2506.04648
Copy Paste: [[2506.04648]] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion(https://arxiv.org/abs/2506.04648)
Keywords: diffusion, generative
Abstract: Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint this http URL introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.

Title: Feature-Based Lie Group Transformer for Real-World Applications

Authors: Takayuki Komatsu, Yoshiyuki Ohmura, Kayato Nishitsunoi, Yasuo Kuniyoshi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04668
Pdf URL: https://arxiv.org/pdf/2506.04668
Copy Paste: [[2506.04668]] Feature-Based Lie Group Transformer for Real-World Applications(https://arxiv.org/abs/2506.04668)
Keywords: extraction, transformer, segmentation
Abstract: The main goal of representation learning is to acquire meaningful representations from real-world sensory inputs without supervision. Representation learning explains some aspects of human development. Various neural network (NN) models have been proposed that acquire empirically good representations. However, the formulation of a good representation has not been established. We recently proposed a method for categorizing changes between a pair of sensory inputs. A unique feature of this approach is that transformations between two sensory inputs are learned to satisfy algebraic structural constraints. Conventional representation learning often assumes that disentangled independent feature axes is a good representation; however, we found that such a representation cannot account for conditional independence. To overcome this problem, we proposed a new method using group decomposition in Galois algebra theory. Although this method is promising for defining a more general representation, it assumes pixel-to-pixel translation without feature extraction, and can only process low-resolution images with no background, which prevents real-world application. In this study, we provide a simple method to apply our group decomposition theory to a more realistic scenario by combining feature extraction and object segmentation. We replace pixel translation with feature translation and formulate object segmentation as grouping features under the same transformation. We validated the proposed method on a practical dataset containing both real-world object and background. We believe that our model will lead to a better understanding of human development of object recognition in the real world.

Title: FedAPM: Federated Learning via ADMM with Partial Model Personalization

Authors: Shengkun Zhu, Feiteng Nie, Jinshan Zeng, Sheng Wang, Yuan Sun, Yuan Yao, Shangfeng Chen, Quanqing Xu, Chuanhui Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04672
Pdf URL: https://arxiv.org/pdf/2506.04672
Copy Paste: [[2506.04672]] FedAPM: Federated Learning via ADMM with Partial Model Personalization(https://arxiv.org/abs/2506.04672)
Keywords: federate
Abstract: In federated learning (FL), the assumption that datasets from different devices are independent and identically distributed (i.i.d.) often does not hold due to user differences, and the presence of various data modalities across clients makes using a single model impractical. Personalizing certain parts of the model can effectively address these issues by allowing those parts to differ across clients, while the remaining parts serve as a shared model. However, we found that partial model personalization may exacerbate client drift (each client's local model diverges from the shared model), thereby reducing the effectiveness and efficiency of FL algorithms. We propose an FL framework based on the alternating direction method of multipliers (ADMM), referred to as FedAPM, to mitigate client drift. We construct the augmented Lagrangian function by incorporating first-order and second-order proximal terms into the objective, with the second-order term providing fixed correction and the first-order term offering compensatory correction between the local and shared models. Our analysis demonstrates that FedAPM, by using explicit estimates of the Lagrange multiplier, is more stable and efficient in terms of convergence compared to other FL frameworks. We establish the global convergence of FedAPM training from arbitrary initial points to a stationary point, achieving three types of rates: constant, linear, and sublinear, under mild assumptions. We conduct experiments using four heterogeneous and multimodal datasets with different metrics to validate the performance of FedAPM. Specifically, FedAPM achieves faster and more accurate convergence, outperforming the SOTA methods with average improvements of 12.3% in test accuracy, 16.4% in F1 score, and 18.0% in AUC while requiring fewer communication rounds.

Title: Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts

Authors: Zhong Ji, Rongshuai Wei, Jingren Liu, Yanwei Pang, Jungong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04673
Pdf URL: https://arxiv.org/pdf/2506.04673
Copy Paste: [[2506.04673]] Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts(https://arxiv.org/abs/2506.04673)
Keywords: interpretability
Abstract: Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal this http URL address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL this http URL, cross-module concept guidance enforces tight alignment between the backbone's feature representations and the prototypical concept activation this http URL addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data this http URL, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision this http URL results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot this http URL findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.

Title: Gen-n-Val: Agentic Image Data Generation and Validation

Authors: Jing-En Huang, I-Sheng Fang, Tzuhsuan Huang, Chih-Yu Wang, Jun-Cheng Chen
Subjects: cs.CV, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2506.04676
Pdf URL: https://arxiv.org/pdf/2506.04676
Copy Paste: [[2506.04676]] Gen-n-Val: Agentic Image Data Generation and Validation(https://arxiv.org/abs/2506.04676)
Keywords: diffusion, large language model, segmentation
Abstract: Recently, Large Language Models (LLMs) and Vision Large Language Models (VLLMs) have demonstrated impressive performance as agents across various tasks while data scarcity and label noise remain significant challenges in computer vision tasks, such as object detection and instance segmentation. A common solution for resolving these issues is to generate synthetic data. However, current synthetic data generation methods struggle with issues, such as multiple objects per mask, inaccurate segmentation, and incorrect category labels, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), LLMs, and VLLMs to produce high-quality, single-object masks and diverse backgrounds. Gen-n-Val consists of two agents: (1) The LD prompt agent, an LLM, optimizes prompts for LD to generate high-quality foreground instance images and segmentation masks. These optimized prompts ensure the generation of single-object synthetic data with precise instance masks and clean backgrounds. (2) The data validation agent, a VLLM, which filters out low-quality synthetic instance images. The system prompts for both agents are refined through TextGrad. Additionally, we use image harmonization to combine multiple instances within scenes. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 1% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7. 1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val improves the performance of YOLOv9 and YOLO11 families in instance segmentation and object detection.

Title: The cost of ensembling: is it always worth combining?

Authors: Marco Zanotti
Subjects: cs.LG, stat.AP, stat.OT
Abstract URL: https://arxiv.org/abs/2506.04677
Pdf URL: https://arxiv.org/pdf/2506.04677
Copy Paste: [[2506.04677]] The cost of ensembling: is it always worth combining?(https://arxiv.org/abs/2506.04677)
Keywords: robust
Abstract: Given the continuous increase in dataset sizes and the complexity of forecasting models, the trade-off between forecast accuracy and computational cost is emerging as an extremely relevant topic, especially in the context of ensemble learning for time series forecasting. To asses it, we evaluated ten base models and eight ensemble configurations across two large-scale retail datasets (M5 and VN1), considering both point and probabilistic accuracy under varying retraining frequencies. We showed that ensembles consistently improve forecasting performance, particularly in probabilistic settings. However, these gains come at a substantial computational cost, especially for larger, accuracy-driven ensembles. We found that reducing retraining frequency significantly lowers costs, with minimal impact on accuracy, particularly for point forecasts. Moreover, efficiency-driven ensembles offer a strong balance, achieving competitive accuracy with considerably lower costs compared to accuracy-optimized combinations. Most importantly, small ensembles of two or three models are often sufficient to achieve near-optimal results. These findings provide practical guidelines for deploying scalable and cost-efficient forecasting systems, supporting the broader goals of sustainable AI in forecasting. Overall, this work shows that careful ensemble design and retraining strategy selection can yield accurate, robust, and cost-effective forecasts suitable for real-world applications.

Title: Normative Conflicts and Shallow AI Alignment

Authors: Raphaël Millière
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04679
Pdf URL: https://arxiv.org/pdf/2506.04679
Copy Paste: [[2506.04679]] Normative Conflicts and Shallow AI Alignment(https://arxiv.org/abs/2506.04679)
Keywords: attack, robust, large language model
Abstract: The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment'' problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.

Title: Urania: Differentially Private Insights into AI Use

Authors: Daogao Liu, Edith Cohen, Badih Ghazi, Peter Kairouz, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Adam Sealfon, Da Yu, Chiyuan Zhang
Subjects: cs.LG, cs.AI, cs.CL, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2506.04681
Pdf URL: https://arxiv.org/pdf/2506.04681
Copy Paste: [[2506.04681]] Urania: Differentially Private Insights into AI Use(https://arxiv.org/abs/2506.04681)
Keywords: privacy, protect, robust, extraction
Abstract: We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.

Title: MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

Authors: Chuyun Deng, Na Liu, Wei Xie, Lianming Xu, Li Wang
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2506.04682
Pdf URL: https://arxiv.org/pdf/2506.04682
Copy Paste: [[2506.04682]] MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements(https://arxiv.org/abs/2506.04682)
Keywords: extraction, transformer
Abstract: Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting generalization. To address this, we propose MARS, a Multi-scale Aware Radiomap Super-resolution method that combines CNNs and Transformers with multi-scale feature fusion and residual connections. MARS focuses on both global and local feature extraction, enhancing feature representation across different receptive fields and improving reconstruction accuracy. Experiments across different scenes and antenna locations show that MARS outperforms baseline models in both MSE and SSIM, while maintaining low computational cost, demonstrating strong practical potential.

Title: MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

Authors: Gio Paik, Geewook Kim, Jinbae Im
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04688
Pdf URL: https://arxiv.org/pdf/2506.04688
Copy Paste: [[2506.04688]] MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models(https://arxiv.org/abs/2506.04688)
Keywords: robust, large language model
Abstract: This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at this https URL.

Title: Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Authors: Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04689
Pdf URL: https://arxiv.org/pdf/2506.04689
Copy Paste: [[2506.04689]] Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models(https://arxiv.org/abs/2506.04689)
Keywords: extraction, large language model
Abstract: Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.

Title: Towards Better Generalization via Distributional Input Projection Network

Authors: Yifan Hao, Yanxin Lu, Xinwei Shen, Tong Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04690
Pdf URL: https://arxiv.org/pdf/2506.04690
Copy Paste: [[2506.04690]] Towards Better Generalization via Distributional Input Projection Network(https://arxiv.org/abs/2506.04690)
Keywords: attack, transformer, large language model
Abstract: As overparameterized models become increasingly prevalent, training loss alone offers limited insight into generalization performance. While smoothness has been linked to improved generalization across various settings, directly enforcing smoothness in neural networks remains challenging. To address this, we introduce Distributional Input Projection Networks (DIPNet), a novel framework that projects inputs into learnable distributions at each layer. This distributional representation induces a smoother loss landscape with respect to the input, promoting better generalization. We provide theoretical analysis showing that DIPNet reduces both local smoothness measures and the Lipschitz constant of the network, contributing to improved generalization performance. Empirically, we validate DIPNet across a wide range of architectures and tasks, including Vision Transformers (ViTs), Large Language Models (LLMs), ResNet and MLPs. Our method consistently enhances test performance under standard settings, adversarial attacks, out-of-distribution inputs, and reasoning benchmarks. We demonstrate that the proposed input projection strategy can be seamlessly integrated into existing models, providing a general and effective approach for boosting generalization performance in modern deep learning.

Title: Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification

Authors: Lu Wei, Liangzhi Li, Tong Xiang, Xiao Liu, Noa Garcia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04693
Pdf URL: https://arxiv.org/pdf/2506.04693
Copy Paste: [[2506.04693]] Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification(https://arxiv.org/abs/2506.04693)
Keywords: large language model
Abstract: The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.

Title: Influence Functions for Edge Edits in Non-Convex Graph Neural Networks

Authors: Jaeseung Heo, Kyeongheung Yun, Seokwon Yoon, MoonJeong Park, Jungseul Ok, Dongwoo Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04694
Pdf URL: https://arxiv.org/pdf/2506.04694
Copy Paste: [[2506.04694]] Influence Functions for Edge Edits in Non-Convex Graph Neural Networks(https://arxiv.org/abs/2506.04694)
Keywords: attack, robust, interpretability
Abstract: Understanding how individual edges influence the behavior of graph neural networks (GNNs) is essential for improving their interpretability and robustness. Graph influence functions have emerged as promising tools to efficiently estimate the effects of edge deletions without retraining. However, existing influence prediction methods rely on strict convexity assumptions, exclusively consider the influence of edge deletions while disregarding edge insertions, and fail to capture changes in message propagation caused by these modifications. In this work, we propose a proximal Bregman response function specifically tailored for GNNs, relaxing the convexity requirement and enabling accurate influence prediction for standard neural network architectures. Furthermore, our method explicitly accounts for message propagation effects and extends influence prediction to both edge deletions and insertions in a principled way. Experiments with real-world datasets demonstrate accurate influence predictions for different characteristics of GNNs. We further demonstrate that the influence function is versatile in applications such as graph rewiring and adversarial attacks.

Title: Enhanced Drought Analysis in Bangladesh: A Machine Learning Approach for Severity Classification Using Satellite Data

Authors: Tonmoy Paul, Mrittika Devi Mati, Md. Mahmudul Islam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04696
Pdf URL: https://arxiv.org/pdf/2506.04696
Copy Paste: [[2506.04696]] Enhanced Drought Analysis in Bangladesh: A Machine Learning Approach for Severity Classification Using Satellite Data(https://arxiv.org/abs/2506.04696)
Keywords: security
Abstract: Drought poses a pervasive environmental challenge in Bangladesh, impacting agriculture, socio-economic stability, and food security due to its unique geographic and anthropogenic vulnerabilities. Traditional drought indices, such as the Standardized Precipitation Index (SPI) and Palmer Drought Severity Index (PDSI), often overlook crucial factors like soil moisture and temperature, limiting their resolution. Moreover, current machine learning models applied to drought prediction have been underexplored in the context of Bangladesh, lacking a comprehensive integration of satellite data across multiple districts. To address these gaps, we propose a satellite data-driven machine learning framework to classify drought across 38 districts of Bangladesh. Using unsupervised algorithms like K-means and Bayesian Gaussian Mixture for clustering, followed by classification models such as KNN, Random Forest, Decision Tree, and Naive Bayes, the framework integrates weather data (humidity, soil moisture, temperature) from 2012-2024. This approach successfully classifies drought severity into different levels. However, it shows significant variabilities in drought vulnerabilities across regions which highlights the aptitude of machine learning models in terms of identifying and predicting drought conditions.

Title: Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence

Authors: José Manuel de Frutos, Manuel A. Vázquez, Pablo M. Olmos, Joaquín Míguez
Subjects: cs.LG, cs.AI, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2506.04700
Pdf URL: https://arxiv.org/pdf/2506.04700
Copy Paste: [[2506.04700]] Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence(https://arxiv.org/abs/2506.04700)
Keywords: robust, generative
Abstract: Rank-based statistical metrics, such as the invariant statistical loss (ISL), have recently emerged as robust and practically effective tools for training implicit generative models. In this work, we introduce dual-ISL, a novel likelihood-free objective for training implicit generative models that interchanges the roles of the target and model distributions in the ISL framework, yielding a convex optimization problem in the space of model densities. We prove that the resulting rank-based discrepancy $d_K$ is i) continuous under weak convergence and with respect to the $L^1$ norm, and ii) convex in its first argument-properties not shared by classical divergences such as KL or Wasserstein distances. Building on this, we develop a theoretical framework that interprets $d_K$ as an $L^2$-projection of the density ratio $q = p/\tilde p$ onto a Bernstein polynomial basis, from which we derive exact bounds on the truncation error, precise convergence rates, and a closed-form expression for the truncated density approximation. We further extend our analysis to the multivariate setting via random one-dimensional projections, defining a sliced dual-ISL divergence that retains both convexity and continuity. We empirically show that these theoretical advantages translate into practical ones. Specifically, across several benchmarks dual-ISL converges more rapidly, delivers markedly smoother and more stable training, and more effectively prevents mode collapse than classical ISL and other leading implicit generative methods-while also providing an explicit density approximation.

Title: HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04704
Pdf URL: https://arxiv.org/pdf/2506.04704
Copy Paste: [[2506.04704]] HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model(https://arxiv.org/abs/2506.04704)
Keywords: attack, robust
Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

Title: UNO: Unlearning via Orthogonalization in Generative models

Authors: Pinak Mandal, Georg A. Gottwald
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04712
Pdf URL: https://arxiv.org/pdf/2506.04712
Copy Paste: [[2506.04712]] UNO: Unlearning via Orthogonalization in Generative models(https://arxiv.org/abs/2506.04712)
Keywords: privacy, generative
Abstract: As generative models become increasingly powerful and pervasive, the ability to unlearn specific data, whether due to privacy concerns, legal requirements, or the correction of harmful content, has become increasingly important. Unlike in conventional training, where data are accumulated and knowledge is reinforced, unlearning aims to selectively remove the influence of particular data points without costly retraining from scratch. To be effective and reliable, such algorithms need to achieve (i) forgetting of the undesired data, (ii) preservation of the quality of the generation, (iii) preservation of the influence of the desired training data on the model parameters, and (iv) small number of training steps. We propose fast unlearning algorithms based on loss gradient orthogonalization. We show that our algorithms are able to forget data while maintaining the fidelity of the original model. Using MNIST and CelebA data, we demonstrate that our algorithms achieve orders of magnitude faster unlearning times than their predecessors, such as gradient surgery.

Title: Robust Few-Shot Vision-Language Model Adaptation

Authors: Hanxin Wang, Tian Liu, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04713
Pdf URL: https://arxiv.org/pdf/2506.04713
Copy Paste: [[2506.04713]] Robust Few-Shot Vision-Language Model Adaptation(https://arxiv.org/abs/2506.04713)
Keywords: robust
Abstract: Pretrained VLMs achieve strong performance on downstream tasks when adapted with just a few labeled examples. As the adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data, enhancing OOD generalization in few-shot adaptation is critically important. We study robust few-shot VLM adaptation, aiming to increase both ID and OOD accuracy. By comparing different adaptation methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full finetuning), we uncover three key findings: (1) finetuning with proper hyperparameters significantly outperforms the popular VLM adaptation methods prompt tuning and linear probing; (2) visual encoder-only finetuning achieves better efficiency and accuracy than contrastively finetuning both visual and textual encoders; (3) finetuning the top layers of the visual encoder provides the best balance between ID and OOD accuracy. Building on these findings, we propose partial finetuning of the visual encoder empowered with two simple augmentation techniques: (1) retrieval augmentation which retrieves task-relevant data from the VLM's pretraining dataset to enhance adaptation, and (2) adversarial perturbation which promotes robustness during finetuning. Results show that the former/latter boosts OOD/ID accuracy while slightly sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining the two does not maintain their best OOD/ID accuracy. We address this dilemma with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD benchmarks.

Title: Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model

Authors: Zelu Qi, Ping Shi, Chaoyang Zhang, Shuqi Wang, Fei Zhao, Da Pan, Zefeng Ying
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04715
Pdf URL: https://arxiv.org/pdf/2506.04715
Copy Paste: [[2506.04715]] Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model(https://arxiv.org/abs/2506.04715)
Keywords: generative, large language model
Abstract: The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user's viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbf{second place} in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at this https URL.

Title: Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion

Authors: Hongyu Wang, Yonghao Long, Yueyao Chen, Hon-Chi Yip, Markus Scheppach, Philip Wai-Yan Chiu, Yeung Yam, Helen Mei-Ling Meng, Qi Dou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04716
Pdf URL: https://arxiv.org/pdf/2506.04716
Copy Paste: [[2506.04716]] Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion(https://arxiv.org/abs/2506.04716)
Keywords: robust, diffusion
Abstract: Endoscopic Submucosal Dissection (ESD) is a well-established technique for removing epithelial lesions. Predicting dissection trajectories in ESD videos offers significant potential for enhancing surgical skill training and simplifying the learning process, yet this area remains underexplored. While imitation learning has shown promise in acquiring skills from expert demonstrations, challenges persist in handling uncertain future movements, learning geometric symmetries, and generalizing to diverse surgical scenarios. To address these, we introduce a novel approach: Implicit Diffusion Policy with Equivariant Representations for Imitation Learning (iDPOE). Our method models expert behavior through a joint state action distribution, capturing the stochastic nature of dissection trajectories and enabling robust visual representation learning across various endoscopic views. By incorporating a diffusion model into policy learning, iDPOE ensures efficient training and sampling, leading to more accurate predictions and better generalization. Additionally, we enhance the model's ability to generalize to geometric symmetries by embedding equivariance into the learning process. To address state mismatches, we develop a forward-process guided action inference strategy for conditional sampling. Using an ESD video dataset of nearly 2000 clips, experimental results show that our approach surpasses state-of-the-art methods, both explicit and implicit, in trajectory prediction. To the best of our knowledge, this is the first application of imitation learning to surgical skill development for dissection trajectory prediction.

Title: Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection

Authors: Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Yibo Zhang, Zhenyu Guan, Chaozhuo Li, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04739
Pdf URL: https://arxiv.org/pdf/2506.04739
Copy Paste: [[2506.04739]] Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection(https://arxiv.org/abs/2506.04739)
Keywords: robust, large language model
Abstract: The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C$^2$EFND) framework to address these challenges. The C$^2$EFND framework strategically leverages both LLMs' generalization power and SLMs' classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C$^2$EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.

Title: SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Authors: Shuhan Xu, Siyuan Liang, Hongling Zheng, Yong Luo, Aishan Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04743
Pdf URL: https://arxiv.org/pdf/2506.04743
Copy Paste: [[2506.04743]] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs(https://arxiv.org/abs/2506.04743)
Keywords: defense, attack, robust, steal, generative
Abstract: Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.

Title: Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models

Authors: Fei Ding, Baiqiao Wang, Zijian Zeng, Youwei Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04746
Pdf URL: https://arxiv.org/pdf/2506.04746
Copy Paste: [[2506.04746]] Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models(https://arxiv.org/abs/2506.04746)
Keywords: large language model
Abstract: The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate supervision in GRPO frequently leads to inefficient exploration dynamics. A single error in a complex reasoning chain can invalidate the entire solution, resulting in abrupt reward vanishing and compromising training this http URL address these challenges, we propose MGRPO (Multi-layer GRPO). MGRPO operates in two layers: the first layer employs standard GRPO to generate an initial response. This response, along with the original query, is then fed into a second-layer GRPO process. This second layer is specifically trained to identify and correct errors in the initial response, effectively creating a self-correction loop. This mechanism provides implicit process-level supervision by rewarding successful error correction, without requiring an explicit, densely-annotated reward model. Experimental results on several mathematical reasoning benchmarks demonstrate that MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.

Title: Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

Authors: Shenshen Li, Kaiyuan Deng, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Heng Tao Shen, Xing Xu
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.04755
Pdf URL: https://arxiv.org/pdf/2506.04755
Copy Paste: [[2506.04755]] Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning(https://arxiv.org/abs/2506.04755)
Keywords: robust, large language model
Abstract: While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at this https URL.

Title: Log-Linear Attention

Authors: Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04761
Pdf URL: https://arxiv.org/pdf/2506.04761
Copy Paste: [[2506.04761]] Log-Linear Attention(https://arxiv.org/abs/2506.04761)
Keywords: transformer
Abstract: The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.

Title: HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

Authors: Suhan Woo, Seongwon Lee, Jinwoo Jang, Euntai Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04764
Pdf URL: https://arxiv.org/pdf/2506.04764
Copy Paste: [[2506.04764]] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition(https://arxiv.org/abs/2506.04764)
Keywords: robust
Abstract: When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy, optimally balancing speed and accuracy to ensure robust matching, even between descriptors from different image types. This approach enables HypeVPR to outperform state-of-the-art methods while significantly reducing retrieval time, achieving up to 5x faster retrieval across diverse benchmark datasets. The code and models will be released at this https URL.

Title: OpenGT: A Comprehensive Benchmark For Graph Transformers

Authors: Jiachen Tang, Zhonghao Wang, Sirui Chen, Sheng Zhou, Jiawei Chen, Jiajun Bu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04765
Pdf URL: https://arxiv.org/pdf/2506.04765
Copy Paste: [[2506.04765]] OpenGT: A Comprehensive Benchmark For Graph Transformers(https://arxiv.org/abs/2506.04765)
Keywords: fair, transformer
Abstract: Graph Transformers (GTs) have recently demonstrated remarkable performance across diverse domains. By leveraging attention mechanisms, GTs are capable of modeling long-range dependencies and complex structural relationships beyond local neighborhoods. However, their applicable scenarios are still underexplored, this highlights the need to identify when and why they excel. Furthermore, unlike GNNs, which predominantly rely on message-passing mechanisms, GTs exhibit a diverse design space in areas such as positional encoding, attention mechanisms, and graph-specific adaptations. Yet, it remains unclear which of these design choices are truly effective and under what conditions. As a result, the community currently lacks a comprehensive benchmark and library to promote a deeper understanding and further development of GTs. To address this gap, this paper introduces OpenGT, a comprehensive benchmark for Graph Transformers. OpenGT enables fair comparisons and multidimensional analysis by establishing standardized experimental settings and incorporating a broad selection of state-of-the-art GNNs and GTs. Our benchmark evaluates GTs from multiple perspectives, encompassing diverse tasks and datasets with varying properties. Through extensive experiments, our benchmark has uncovered several critical insights, including the difficulty of transferring models across task levels, the limitations of local attention, the efficiency trade-offs in several models, the application scenarios of specific positional encodings, and the preprocessing overhead of some positional encodings. We aspire for this work to establish a foundation for future graph transformer research emphasizing fairness, reproducibility, and generalizability. We have developed an easy-to-use library OpenGT for training and evaluating existing GTs. The benchmark code is available at this https URL.

Title: Fine-Grained Interpretation of Political Opinions in Large Language Models

Authors: Jingyu Hu, Mengyue Yang, Mengnan Du, Weiru Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04774
Pdf URL: https://arxiv.org/pdf/2506.04774
Copy Paste: [[2506.04774]] Fine-Grained Interpretation of Political Opinions in Large Language Models(https://arxiv.org/abs/2506.04774)
Keywords: robust, large language model
Abstract: Studies of LLMs' political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs' responses and their internal intentions. This motivates us to probe LLMs' internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs' political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention Experiments show these vectors can intervene in LLMs to generate responses with different political leanings.

Title: MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Authors: Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2506.04779
Pdf URL: https://arxiv.org/pdf/2506.04779
Copy Paste: [[2506.04779]] MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark(https://arxiv.org/abs/2506.04779)
Keywords: large language model
Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at this https URL. Evaluation Code is available at this https URL.

Title: Kernel $k$-Medoids as General Vector Quantization

Authors: Thore Gerlach, Sascha Mücke, Christian Bauckhage
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2506.04786
Pdf URL: https://arxiv.org/pdf/2506.04786
Copy Paste: [[2506.04786]] Kernel $k$-Medoids as General Vector Quantization(https://arxiv.org/abs/2506.04786)
Keywords: interpretability
Abstract: Vector Quantization (VQ) is a widely used technique in machine learning and data compression, valued for its simplicity and interpretability. Among hard VQ methods, $k$-medoids clustering and Kernel Density Estimation (KDE) approaches represent two prominent yet seemingly unrelated paradigms -- one distance-based, the other rooted in probability density matching. In this paper, we investigate their connection through the lens of Quadratic Unconstrained Binary Optimization (QUBO). We compare a heuristic QUBO formulation for $k$-medoids, which balances centrality and diversity, with a principled QUBO derived from minimizing Maximum Mean Discrepancy in KDE-based VQ. Surprisingly, we show that the KDE-QUBO is a special case of the $k$-medoids-QUBO under mild assumptions on the kernel's feature map. This reveals a deeper structural relationship between these two approaches and provides new insight into the geometric interpretation of the weighting parameters used in QUBO formulations for VQ.

Title: Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

Authors: Jisu An, Junseok Lee, Jeoungeun Lee, Yongseok Son
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04788
Pdf URL: https://arxiv.org/pdf/2506.04788
Copy Paste: [[2506.04788]] Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques(https://arxiv.org/abs/2506.04788)
Keywords: robust, large language model
Abstract: The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.

Title: MULTISS: un protocole de stockage confidentiel {à} long terme sur plusieurs r{é}seaux QKD

Authors: Thomas Prévost (I3S), Olivier Alibart (INPHYNI), Marc Kaplan, Anne Marin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.04800
Pdf URL: https://arxiv.org/pdf/2506.04800
Copy Paste: [[2506.04800]] MULTISS: un protocole de stockage confidentiel {à} long terme sur plusieurs r{é}seaux QKD(https://arxiv.org/abs/2506.04800)
Keywords: secure, security
Abstract: This paper presents MULTISS, a new protocol for long-term storage distributed across multiple Quantum Key Distribution (QKD) networks. This protocol is an extension of LINCOS, a secure storage protocol that uses Shamir secret sharing for secret storage on a single QKD network. Our protocol uses hierarchical secret sharing to distribute a secret across multiple QKD networks while ensuring perfect security. Our protocol further allows for sharing updates without having to reconstruct the entire secret. We also prove that MULTISS is strictly more secure than LINCOS, which remains vulnerable when its QKD network is compromised.

Title: SupeRANSAC: One RANSAC to Rule Them All

Authors: Daniel Barath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04803
Pdf URL: https://arxiv.org/pdf/2506.04803
Copy Paste: [[2506.04803]] SupeRANSAC: One RANSAC to Rule Them All(https://arxiv.org/abs/2506.04803)
Keywords: robust
Abstract: Robust estimation is a cornerstone in computer vision, particularly for tasks like Structure-from-Motion and Simultaneous Localization and Mapping. RANSAC and its variants are the gold standard for estimating geometric models (e.g., homographies, relative/absolute poses) from outlier-contaminated data. Despite RANSAC's apparent simplicity, achieving consistently high performance across different problems is challenging. While recent research often focuses on improving specific RANSAC components (e.g., sampling, scoring), overall performance is frequently more influenced by the "bells and whistles" (i.e., the implementation details and problem-specific optimizations) within a given library. Popular frameworks like OpenCV and PoseLib demonstrate varying performance, excelling in some tasks but lagging in others. We introduce SupeRANSAC, a novel unified RANSAC pipeline, and provide a detailed analysis of the techniques that make RANSAC effective for specific vision tasks, including homography, fundamental/essential matrix, and absolute/rigid pose estimation. SupeRANSAC is designed for consistent accuracy across these tasks, improving upon the best existing methods by, for example, 6 AUC points on average for fundamental matrix estimation. We demonstrate significant performance improvements over the state-of-the-art on multiple problems and datasets. Code: this https URL

Title: Adaptive Preconditioners Trigger Loss Spikes in Adam

Authors: Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04805
Pdf URL: https://arxiv.org/pdf/2506.04805
Copy Paste: [[2506.04805]] Adaptive Preconditioners Trigger Loss Spikes in Adam(https://arxiv.org/abs/2506.04805)
Keywords: transformer
Abstract: Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam's adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a $\beta_2$-exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold $2/\eta$ for a sustained period, inducing instability. This instability further leads to an alignment between the gradient and the maximum eigendirection, and a loss spike occurs precisely when the gradient-directional curvature exceeds $2/\eta$. We verify this mechanism through extensive experiments on fully connected networks, convolutional networks, and Transformer architectures.

Title: Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Authors: Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2506.04810
Pdf URL: https://arxiv.org/pdf/2506.04810
Copy Paste: [[2506.04810]] Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study(https://arxiv.org/abs/2506.04810)
Keywords: large language model
Abstract: Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

Title: Design of intelligent proofreading system for English translation based on CNN and BERT

Authors: Feijun Liu, Huifeng Wang, Kun Wang, Yizhen Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04811
Pdf URL: https://arxiv.org/pdf/2506.04811
Copy Paste: [[2506.04811]] Design of intelligent proofreading system for English translation based on CNN and BERT(https://arxiv.org/abs/2506.04811)
Keywords: robust, transformer
Abstract: Since automatic translations can contain errors that require substantial human post-editing, machine translation proofreading is essential for improving quality. This paper proposes a novel hybrid approach for robust proofreading that combines convolutional neural networks (CNN) with Bidirectional Encoder Representations from Transformers (BERT). In order to extract semantic information from phrases and expressions, CNN uses a variety of convolution kernel filters to capture local n-gram patterns. In the meanwhile, BERT creates context-rich representations of whole sequences by utilizing stacked bidirectional transformer encoders. Using BERT's attention processes, the integrated error detection component relates tokens to spot translation irregularities including word order problems and omissions. The correction module then uses parallel English-German alignment and GRU decoder models in conjunction with translation memory to propose logical modifications that maintain original meaning. A unified end-to-end training process optimized for post-editing performance is applied to the whole pipeline. The multi-domain collection of WMT and the conversational dialogues of Open-Subtitles are two of the English-German parallel corpora used to train the model. Multiple loss functions supervise detection and correction capabilities. Experiments attain a 90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading techniques by over 10% overall. Comparative benchmarking demonstrates state-of-the-art performance in identifying and coherently rectifying mistranslations and omissions.

Title: Spike-TBR: a Noise Resilient Neuromorphic Event Representation

Authors: Gabriele Magrini. Federico Becattini, Luca Cultrera, Lorenzo Berlincioni, Pietro Pala, Alberto Del Bimbo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04817
Pdf URL: https://arxiv.org/pdf/2506.04817
Copy Paste: [[2506.04817]] Spike-TBR: a Noise Resilient Neuromorphic Event Representation(https://arxiv.org/abs/2506.04817)
Keywords: robust
Abstract: Event cameras offer significant advantages over traditional frame-based sensors, including higher temporal resolution, lower latency and dynamic range. However, efficiently converting event streams into formats compatible with standard computer vision pipelines remains a challenging problem, particularly in the presence of noise. In this paper, we propose Spike-TBR, a novel event-based encoding strategy based on Temporal Binary Representation (TBR), addressing its vulnerability to noise by integrating spiking neurons. Spike-TBR combines the frame-based advantages of TBR with the noise-filtering capabilities of spiking neural networks, creating a more robust representation of event streams. We evaluate four variants of Spike-TBR, each using different spiking neurons, across multiple datasets, demonstrating superior performance in noise-affected scenarios while improving the results on clean data. Our method bridges the gap between spike-based and frame-based processing, offering a simple noise-resilient solution for event-driven vision applications.

Title: LogicPuzzleRL: Cultivating Robust Mathematical Reasoning in LLMs via Reinforcement Learning

Authors: Zhen Hao Wong, Jingwen Deng, Runming He, Zirong Chen, Qijie You, Hejun Dong, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04821
Pdf URL: https://arxiv.org/pdf/2506.04821
Copy Paste: [[2506.04821]] LogicPuzzleRL: Cultivating Robust Mathematical Reasoning in LLMs via Reinforcement Learning(https://arxiv.org/abs/2506.04821)
Keywords: robust, large language model
Abstract: Large language models (LLMs) excel at many supervised tasks but often struggle with structured reasoning in unfamiliar settings. This discrepancy suggests that standard fine-tuning pipelines may instill narrow, domain-specific heuristics rather than fostering general-purpose thinking strategies. In this work, we propose a "play to learn" framework that fine-tunes LLMs through reinforcement learning on a suite of seven custom logic puzzles, each designed to cultivate distinct reasoning skills such as constraint propagation, spatial consistency, and symbolic deduction. Using a reinforcement learning setup with verifiable rewards, models receive binary feedback based on puzzle correctness, encouraging iterative, hypothesis-driven problem solving. We demonstrate that this training approach significantly improves out-of-distribution performance on a range of mathematical benchmarks, especially for mid-difficulty problems that require multi-step reasoning. Analyses across problem categories and difficulty levels reveal that puzzle training promotes transferable reasoning routines, strengthening algebraic manipulation, geometric inference, and combinatorial logic, while offering limited gains on rote or highly specialized tasks. These findings show that reinforcement learning over logic puzzles reshapes the internal reasoning of LLMs, enabling more robust and compositional generalization without relying on task-specific symbolic tools.

Title: Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Authors: Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Raqib Chowdhury, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04822
Pdf URL: https://arxiv.org/pdf/2506.04822
Copy Paste: [[2506.04822]] Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms(https://arxiv.org/abs/2506.04822)
Keywords: large language model
Abstract: Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.

Title: Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors

Authors: Svetlana Pavlitska, Jamie Robb, Nikolai Polley, Melih Yazgan, J. Marius Zöllner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04823
Pdf URL: https://arxiv.org/pdf/2506.04823
Copy Paste: [[2506.04823]] Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors(https://arxiv.org/abs/2506.04823)
Keywords: attack
Abstract: Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at this https URL.

Title: DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

Authors: Shuo Cao, Yihao Liu, Xiaohui Li.Yuanting Gao.Yu Zhou, Chao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04830
Pdf URL: https://arxiv.org/pdf/2506.04830
Copy Paste: [[2506.04830]] DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation(https://arxiv.org/abs/2506.04830)
Keywords: extraction, transformer
Abstract: Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution (DualX-VSR), which introduces a novel dual axial spatial$\times$temporal attention mechanism that integrates spatial and temporal information along orthogonal directions. DualX-VSR eliminates the need for motion compensation, offering a simplified structure that provides a cohesive representation of spatiotemporal information. As a result, DualX-VSR achieves high fidelity and superior performance in real-world VSR task.

Title: Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Authors: Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04832
Pdf URL: https://arxiv.org/pdf/2506.04832
Copy Paste: [[2506.04832]] Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models(https://arxiv.org/abs/2506.04832)
Keywords: robust, large language model
Abstract: Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: this https URL.

Title: OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model

Authors: Kunshen Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04837
Pdf URL: https://arxiv.org/pdf/2506.04837
Copy Paste: [[2506.04837]] OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model(https://arxiv.org/abs/2506.04837)
Keywords: large language model, segmentation
Abstract: Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.

Title: On Automating Security Policies with Contemporary LLMs

Authors: Pablo Fernández Saura, K. R. Jayaram, Vatche Isahagian, Jorge Bernal Bernabé, Antonio Skarmeta
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04838
Pdf URL: https://arxiv.org/pdf/2506.04838
Copy Paste: [[2506.04838]] On Automating Security Policies with Contemporary LLMs(https://arxiv.org/abs/2506.04838)
Keywords: security, attack, robust, large language model
Abstract: The complexity of modern computing environments and the growing sophistication of cyber threats necessitate a more robust, adaptive, and automated approach to security enforcement. In this paper, we present a framework leveraging large language models (LLMs) for automating attack mitigation policy compliance through an innovative combination of in-context learning and retrieval-augmented generation (RAG). We begin by describing how our system collects and manages both tool and API specifications, storing them in a vector database to enable efficient retrieval of relevant information. We then detail the architectural pipeline that first decomposes high-level mitigation policies into discrete tasks and subsequently translates each task into a set of actionable API calls. Our empirical evaluation, conducted using publicly available CTI policies in STIXv2 format and Windows API documentation, demonstrates significant improvements in precision, recall, and F1-score when employing RAG compared to a non-RAG baseline.

Title: Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights

Authors: Giorgio Biancini, Alessio Ferrato, Carla Limongelli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04851
Pdf URL: https://arxiv.org/pdf/2506.04851
Copy Paste: [[2506.04851]] Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights(https://arxiv.org/abs/2506.04851)
Keywords: large language model
Abstract: Integrating Artificial Intelligence (AI) in educational settings has brought new learning approaches, transforming the practices of both students and educators. Among the various technologies driving this transformation, Large Language Models (LLMs) have emerged as powerful tools for creating educational materials and question answering, but there are still space for new applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess student knowledge, but manually generating these questions is resource-intensive and requires significant time and cognitive effort. In our opinion, LLMs offer a promising solution to these challenges. This paper presents a novel comparative analysis of three widely known LLMs - Llama 2, Mistral, and GPT-3.5 - to explore their potential for creating informative and challenging MCQs. In our approach, we do not rely on the knowledge of the LLM, but we inject the knowledge into the prompt to contrast the hallucinations, giving the educators control over the test's source text, too. Our experiment involving 21 educators shows that GPT-3.5 generates the most effective MCQs across several known metrics. Additionally, it shows that there is still some reluctance to adopt AI in the educational field. This study sheds light on the potential of LLMs to generate MCQs and improve the educational experience, providing valuable insights for the future.

Title: A Private Smart Wallet with Probabilistic Compliance

Authors: Andrea Rizzini, Marco Esposito, Francesco Bruschi, Donatella Sciuto
Subjects: cs.CR, cs.CE
Abstract URL: https://arxiv.org/abs/2506.04853
Pdf URL: https://arxiv.org/pdf/2506.04853
Copy Paste: [[2506.04853]] A Private Smart Wallet with Probabilistic Compliance(https://arxiv.org/abs/2506.04853)
Keywords: privacy
Abstract: We propose a privacy-preserving smart wallet with a novel invitation-based private onboarding mechanism. The solution integrates two levels of compliance in concert with an authority party: a proof of innocence mechanism and an ancestral commitment tracking system using bloom filters for probabilistic UTXO chain states. Performance analysis demonstrates practical efficiency: private transfers with compliance checks complete within seconds on a consumer-grade laptop, and overall with proof generation remaining low. On-chain costs stay minimal, ensuring affordability for all operations on Base layer 2 network. The wallet facilitates private contact list management through encrypted data blobs while maintaining transaction unlinkability. Our evaluation validates the approach's viability for privacy-preserving, compliance-aware digital payments with minimized computational and financial overhead.

Title: Prompting LLMs: Length Control for Isometric Machine Translation

Authors: Dávid Javorský, Ondřej Bojar, François Yvon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04855
Pdf URL: https://arxiv.org/pdf/2506.04855
Copy Paste: [[2506.04855]] Prompting LLMs: Length Control for Isometric Machine Translation(https://arxiv.org/abs/2506.04855)
Keywords: large language model
Abstract: In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

Title: Sparse Autoencoders, Again?

Authors: Yin Lu, Tong He, Xuening Zhu, David Wipf
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04859
Pdf URL: https://arxiv.org/pdf/2506.04859
Copy Paste: [[2506.04859]] Sparse Autoencoders, Again?(https://arxiv.org/abs/2506.04859)
Keywords: diffusion, large language model
Abstract: Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could reflect, among other things, correlation patterns in large language model activations, or complex natural image manifolds. And yet despite the wide-ranging applicability, there have been relatively few changes to SAEs beyond the original recipe from decades ago, namely, standard deep encoder/decoder layers trained with a classical/deterministic sparse regularizer applied within the latent space. One possible exception is the variational autoencoder (VAE), which adopts a stochastic encoder module capable of producing sparse representations when applied to manifold data. In this work we formalize underappreciated weaknesses with both canonical SAEs, as well as analogous VAEs applied to similar tasks, and propose a hybrid alternative model that circumvents these prior limitations. In terms of theoretical support, we prove that global minima of our proposed model recover certain forms of structured data spread across a union of manifolds. Meanwhile, empirical evaluations on synthetic and real-world datasets substantiate the efficacy of our approach in accurately estimating underlying manifold dimensions and producing sparser latent representations without compromising reconstruction error. In general, we are able to exceed the performance of equivalent-capacity SAEs and VAEs, as well as recent diffusion models where applicable, within domains such as images and language model activation patterns.

Title: Geological Field Restoration through the Lens of Image Inpainting

Authors: Vladislav Trifonov, Ivan Oseledets, Ekaterina Muravleva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04869
Pdf URL: https://arxiv.org/pdf/2506.04869
Copy Paste: [[2506.04869]] Geological Field Restoration through the Lens of Image Inpainting(https://arxiv.org/abs/2506.04869)
Keywords: robust
Abstract: We present a new viewpoint on a reconstructing multidimensional geological fields from sparse observations. Drawing inspiration from deterministic image inpainting techniques, we model a partially observed spatial field as a multidimensional tensor and recover missing values by enforcing a global low-rank structure. Our approach combines ideas from tensor completion and geostatistics, providing a robust optimization framework. Experiments on synthetic geological fields demonstrate that used tensor completion method significant improvements in reconstruction accuracy over ordinary kriging for various percent of observed data.

Title: There Was Never a Bottleneck in Concept Bottleneck Models

Authors: Antonio Almudévar, José Miguel Hernández-Lobato, Alfonso Ortega
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04877
Pdf URL: https://arxiv.org/pdf/2506.04877
Copy Paste: [[2506.04877]] There Was Never a Bottleneck in Concept Bottleneck Models(https://arxiv.org/abs/2506.04877)
Keywords: interpretability
Abstract: Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs support concept-level interventions with theoretical guarantees, remain consistent with Bayesian principles, and offer greater flexibility in key design choices.

Title: Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking

Authors: Yu-Feng Chen, Tzuhsuan Huang, Pin-Yen Chiu, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04879
Pdf URL: https://arxiv.org/pdf/2506.04879
Copy Paste: [[2506.04879]] Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking(https://arxiv.org/abs/2506.04879)
Keywords: attack, watermark, diffusion
Abstract: Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model's behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:this https URL

Title: Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

Authors: Wenxi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04887
Pdf URL: https://arxiv.org/pdf/2506.04887
Copy Paste: [[2506.04887]] Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies(https://arxiv.org/abs/2506.04887)
Keywords: large language model
Abstract: Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and 6.08\% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.

Title: Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study

Authors: Andrew Hamara, Greg Hamerly, Pablo Rivas, Andrew C. Freeman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04892
Pdf URL: https://arxiv.org/pdf/2506.04892
Copy Paste: [[2506.04892]] Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study(https://arxiv.org/abs/2506.04892)
Keywords: transformer
Abstract: Modern chess engines achieve superhuman performance through deep tree search and regressive evaluation, while human players rely on intuition to select candidate moves followed by a shallow search to validate them. To model this intuition-driven planning process, we train a transformer encoder using supervised contrastive learning to embed board states into a latent space structured by positional evaluation. In this space, distance reflects evaluative similarity, and visualized trajectories display interpretable transitions between game states. We demonstrate that move selection can occur entirely within this embedding space by advancing toward favorable regions, without relying on deep search. Despite using only a 6-ply beam search, our model achieves an estimated Elo rating of 2593. Performance improves with both model size and embedding dimensionality, suggesting that latent planning may offer a viable alternative to traditional search. Although we focus on chess, the proposed embedding-based planning method can be generalized to other perfect-information games where state evaluations are learnable. All source code is available at this https URL.

Title: ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Authors: Shiyi Xu, Yiwen Hu, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04894
Pdf URL: https://arxiv.org/pdf/2506.04894
Copy Paste: [[2506.04894]] ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests(https://arxiv.org/abs/2506.04894)
Keywords: robust, large language model
Abstract: With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: this https URL

Title: From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Authors: Tianxu Wang, Zhuofan Zhang, Ziyu Zhu, Yue Fan, Jing Xiong, Pengxiang Li, Xiaojian Ma, Qing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04897
Pdf URL: https://arxiv.org/pdf/2506.04897
Copy Paste: [[2506.04897]] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes(https://arxiv.org/abs/2506.04897)
Keywords: large language model
Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.

Title: Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots

Authors: Alex Pan, Mary-Anne Williams
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04907
Pdf URL: https://arxiv.org/pdf/2506.04907
Copy Paste: [[2506.04907]] Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots(https://arxiv.org/abs/2506.04907)
Keywords: large language model
Abstract: Large Language Models (LLMs), whilst great at extracting facts from text, struggle with nested narrative reasoning. Existing long context and multi-hop QA benchmarks inadequately test this, lacking realistic distractors or failing to decouple context length from reasoning complexity, masking a fundamental LLM limitation. We introduce Verbose ListOps, a novel benchmark that programmatically transposes ListOps computations into lengthy, coherent stories. This uniquely forces internal computation and state management of nested reasoning problems by withholding intermediate results, and offers fine-grained controls for both narrative size \emph{and} reasoning difficulty. Whilst benchmarks like LongReason (2025) advance approaches for synthetically expanding the context size of multi-hop QA problems, Verbose ListOps pinpoints a specific LLM vulnerability: difficulty in state management for nested sub-reasoning amongst semantically-relevant, distracting narrative. Our experiments show that leading LLMs (e.g., OpenAI o4, Gemini 2.5 Pro) collapse in performance on Verbose ListOps at modest (~10k token) narrative lengths, despite effortlessly solving raw ListOps equations. Addressing this failure is paramount for real-world text interpretation which requires identifying key reasoning points, tracking conceptual intermediate results, and filtering irrelevant information. Verbose ListOps, and its extensible generation framework thus enables targeted reasoning enhancements beyond mere context-window expansion; a critical step to automating the world's knowledge work.

Title: Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer

Authors: Filip Slezak, Magnus K. Gjerde, Joakim B. Haurum, Ivan Nikolov, Morten S. Laursen, Thomas B. Moeslund
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04908
Pdf URL: https://arxiv.org/pdf/2506.04908
Copy Paste: [[2506.04908]] Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer(https://arxiv.org/abs/2506.04908)
Keywords: robust
Abstract: In this paper, we introduce a 3D Gaussian Splatting (3DGS)-based pipeline for stereo dataset generation, offering an efficient alternative to Neural Radiance Fields (NeRF)-based methods. To obtain useful geometry estimates, we explore utilizing the reconstructed geometry from the explicit 3D representations as well as depth estimates from the FoundationStereo model in an expert knowledge transfer setup. We find that when fine-tuning stereo models on 3DGS-generated datasets, we demonstrate competitive performance in zero-shot generalization benchmarks. When using the reconstructed geometry directly, we observe that it is often noisy and contains artifacts, which propagate noise to the trained model. In contrast, we find that the disparity estimates from FoundationStereo are cleaner and consequently result in a better performance on the zero-shot generalization benchmarks. Our method highlights the potential for low-cost, high-fidelity dataset creation and fast fine-tuning for deep stereo models. Moreover, we also reveal that while the latest Gaussian Splatting based methods have achieved superior performance on established benchmarks, their robustness falls short in challenging in-the-wild settings warranting further exploration.

Title: Dissecting Long Reasoning Models: An Empirical Study

Authors: Yongyu Mu, Jiali Zeng, Bei Li, Xinyan Guan, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.04913
Pdf URL: https://arxiv.org/pdf/2506.04913
Copy Paste: [[2506.04913]] Dissecting Long Reasoning Models: An Empirical Study(https://arxiv.org/abs/2506.04913)
Keywords: robust
Abstract: Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.

Title: Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback

Authors: Junior Cedric Tonga, KV Aditya Srivatsa, Kaushal Kumar Maurya, Fajri Koto, Ekaterina Kochmar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04920
Pdf URL: https://arxiv.org/pdf/2506.04920
Copy Paste: [[2506.04920]] Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback(https://arxiv.org/abs/2506.04920)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor-student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-of-the-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student's native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.

Title: Predicting ICU In-Hospital Mortality Using Adaptive Transformer Layer Fusion

Authors: Han Wang, Ruoyun He, Guoguang Lao, Ting Liu, Hejiao Luo, Changqi Qin, Hongying Luo, Junmin Huang, Zihan Wei, Lu Chen, Yongzhi Xu, Ziqian Bi, Junhao Song, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Huafeng Liu, Junfeng Hao, Chunjie Tian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04924
Pdf URL: https://arxiv.org/pdf/2506.04924
Copy Paste: [[2506.04924]] Predicting ICU In-Hospital Mortality Using Adaptive Transformer Layer Fusion(https://arxiv.org/abs/2506.04924)
Keywords: robust, transformer
Abstract: Early identification of high-risk ICU patients is crucial for directing limited medical resources. We introduce ALFIA (Adaptive Layer Fusion with Intelligent Attention), a modular, attention-based architecture that jointly trains LoRA (Low-Rank Adaptation) adapters and an adaptive layer-weighting mechanism to fuse multi-layer semantic features from a BERT backbone. Trained on our rigorous cw-24 (CriticalWindow-24) benchmark, ALFIA surpasses state-of-the-art tabular classifiers in AUPRC while preserving a balanced precision-recall profile. The embeddings produced by ALFIA's fusion module, capturing both fine-grained clinical cues and high-level concepts, enable seamless pairing with GBDTs (CatBoost/LightGBM) as ALFIA-boost, and deep neuro networks as ALFIA-nn, yielding additional performance gains. Our experiments confirm ALFIA's superior early-warning performance, by operating directly on routine clinical text, it furnishes clinicians with a convenient yet robust tool for risk stratification and timely intervention in critical-care settings.

Title: ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Authors: Mikołaj Pokrywka, Wojciech Kusa, Mieszko Rutkowski, Mikołaj Koszowski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04929
Pdf URL: https://arxiv.org/pdf/2506.04929
Copy Paste: [[2506.04929]] ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT(https://arxiv.org/abs/2506.04929)
Keywords: transformer
Abstract: Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

Title: CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx

Authors: Lukas Picek, Elisa Belotti, Michal Bojda, Ludek Bufka, Vojtech Cermak, Martin Dula, Rostislav Dvorak, Luboslav Hrdy, Miroslav Jirik, Vaclav Kocourek, Josefa Krausova, Jirı Labuda, Jakub Straka, Ludek Toman, Vlado Trulık, Martin Vana, Miroslav Kutal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04931
Pdf URL: https://arxiv.org/pdf/2506.04931
Copy Paste: [[2506.04931]] CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx(https://arxiv.org/abs/2506.04931)
Keywords: diffusion, segmentation
Abstract: We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, 2D pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx includes more than 30k camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 219 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: Southwest Bohemia and the Western Carpathians. To increase the data variability, we create a complementary synthetic set with more than 100k photorealistic images generated via a Unity-based pipeline and diffusion-driven text-to-texture modeling, covering diverse environments, poses, and coat-pattern variations. To allow testing generalization across spatial and temporal domains, we define three tailored evaluation protocols/splits: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set. This dataset is targeted to be instrumental in benchmarking state-of-the-art models and the development of novel methods for not just individual animal re-identification.

Title: Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining

Authors: Yong Sun, Yipeng Wang, Junyu Shi, Zhiyuan Zhang, Yanmei Xiao, Lei Zhu, Manxi Jiang, Qiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04950
Pdf URL: https://arxiv.org/pdf/2506.04950
Copy Paste: [[2506.04950]] Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining(https://arxiv.org/abs/2506.04950)
Keywords: transformer
Abstract: Artificial intelligence has recently shown promise in automated embryo selection for In-Vitro Fertilization (IVF). However, current approaches either address partial embryo evaluation lacking holistic quality assessment or target clinical outcomes inevitably confounded by extra-embryonic factors, both limiting clinical utility. To bridge this gap, we propose a new task called Video-Based Embryo Grading - the first paradigm that directly utilizes full-length time-lapse monitoring (TLM) videos to predict embryologists' overall quality assessments. To support this task, we curate a real-world clinical dataset comprising over 2,500 TLM videos, each annotated with a grading label indicating the overall quality of embryos. Grounded in clinical decision-making principles, we propose a Complementary Spatial-Temporal Pattern Mining (CoSTeM) framework that conceptually replicates embryologists' evaluation process. The CoSTeM comprises two branches: (1) a morphological branch using a Mixture of Cross-Attentive Experts layer and a Temporal Selection Block to select discriminative local structural features, and (2) a morphokinetic branch employing a Temporal Transformer to model global developmental trajectories, synergistically integrating static and dynamic determinants for grading embryos. Extensive experimental results demonstrate the superiority of our design. This work provides a valuable methodological framework for AI-assisted embryo selection. The dataset and source code will be publicly available upon acceptance.

Title: Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations

Authors: Igor Meleshin, Anna Chistyakova, Anastasia Antsiferova, Dmitriy Vatolin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04951
Pdf URL: https://arxiv.org/pdf/2506.04951
Copy Paste: [[2506.04951]] Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations(https://arxiv.org/abs/2506.04951)
Keywords: defense, attack, robust
Abstract: Image Quality Assessment (IQA) models are increasingly relied upon to evaluate image quality in real-world systems -- from compression and enhancement to generation and streaming. Yet their adoption brings a fundamental risk: these models are inherently unstable. Adversarial manipulations can easily fool them, inflating scores and undermining trust. Traditionally, such vulnerabilities are addressed through data-driven defenses -- adversarial retraining, regularization, or input purification. But what if this is the wrong lens? What if robustness in perceptual models is not something to learn but something to design? In this work, we propose a provocative idea: robustness as an architectural prior. Rather than training models to resist perturbations, we reshape their internal structure to suppress sensitivity from the ground up. We achieve this by enforcing orthogonal information flow, constraining the network to norm-preserving operations -- and further stabilizing the system through pruning and fine-tuning. The result is a robust IQA architecture that withstands adversarial attacks without requiring adversarial training or significant changes to the original model. This approach suggests a shift in perspective: from optimizing robustness through data to engineering it through design.

Title: APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Authors: Hong Gao, Yiming Bao, Xuezhan Tu, Bin Zhong, Minling Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04953
Pdf URL: https://arxiv.org/pdf/2506.04953
Copy Paste: [[2506.04953]] APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval(https://arxiv.org/abs/2506.04953)
Keywords: extraction, large language model
Abstract: Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.

Title: FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Authors: Huihan Wang, Zhiwen Yang, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04956
Pdf URL: https://arxiv.org/pdf/2506.04956
Copy Paste: [[2506.04956]] FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation(https://arxiv.org/abs/2506.04956)
Keywords: transformer
Abstract: Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at this https URL.

Title: PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

Authors: Deniz Simsek, Aryaz Eghbali, Michael Pradel
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.04962
Pdf URL: https://arxiv.org/pdf/2506.04962
Copy Paste: [[2506.04962]] PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages(https://arxiv.org/abs/2506.04962)
Keywords: security, large language model
Abstract: Security vulnerabilities in software packages are a significant concern for developers and users alike. Patching these vulnerabilities in a timely manner is crucial to restoring the integrity and security of software systems. However, previous work has shown that vulnerability reports often lack proof-of-concept (PoC) exploits, which are essential for fixing the vulnerability, testing patches, and avoiding regressions. Creating a PoC exploit is challenging because vulnerability reports are informal and often incomplete, and because it requires a detailed understanding of how inputs passed to potentially vulnerable APIs may reach security-relevant sinks. In this paper, we present PoCGen, a novel approach to autonomously generate and validate PoC exploits for vulnerabilities in npm packages. This is the first fully autonomous approach to use large language models (LLMs) in tandem with static and dynamic analysis techniques for PoC exploit generation. PoCGen leverages an LLM for understanding vulnerability reports, for generating candidate PoC exploits, and for validating and refining them. Our approach successfully generates exploits for 77% of the vulnerabilities in the this http URL dataset and 39% in a new, more challenging dataset of 794 recent vulnerabilities. This success rate significantly outperforms a recent baseline (by 45 absolute percentage points), while imposing an average cost of $0.02 per generated exploit.

Title: Hiding in Plain Sight: Query Obfuscation via Random Multilingual Searches

Authors: Anton Firc, Jan Klusáček, Kamil Malinka
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.04963
Pdf URL: https://arxiv.org/pdf/2506.04963
Copy Paste: [[2506.04963]] Hiding in Plain Sight: Query Obfuscation via Random Multilingual Searches(https://arxiv.org/abs/2506.04963)
Keywords: privacy, protect
Abstract: Modern search engines extensively personalize results by building detailed user profiles based on query history and behaviour. While personalization can enhance relevance, it introduces privacy risks and can lead to filter bubbles. This paper proposes and evaluates a lightweight, client-side query obfuscation strategy using randomly generated multilingual search queries to disrupt user profiling. Through controlled experiments on the this http URL search engine, we assess the impact of interleaving real queries with obfuscating noise in various language configurations and ratios. Our findings show that while displayed search results remain largely stable, the search engine's identified user interests shift significantly under obfuscation. We further demonstrate that such random queries can prevent accurate profiling and overwrite established user profiles. This study provides practical evidence for query obfuscation as a viable privacy-preserving mechanism and introduces a tool that enables users to autonomously protect their search behaviour without modifying existing infrastructure.

Title: From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation

Authors: Adrian Marius Dumitran, Theodor-Pierre Moroianu, Vasile Paul Alexe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04965
Pdf URL: https://arxiv.org/pdf/2506.04965
Copy Paste: [[2506.04965]] From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation(https://arxiv.org/abs/2506.04965)
Keywords: robust, generative, large language model
Abstract: This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs' problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.

Title: Bringing SAM to new heights: Leveraging elevation data for tree crown segmentation from drone imagery

Authors: Mélisande Teng, Arthur Ouaknine, Etienne Laliberté, Yoshua Bengio, David Rolnick, Hugo Larochelle
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04970
Pdf URL: https://arxiv.org/pdf/2506.04970
Copy Paste: [[2506.04970]] Bringing SAM to new heights: Leveraging elevation data for tree crown segmentation from drone imagery(https://arxiv.org/abs/2506.04970)
Keywords: segmentation
Abstract: Information on trees at the individual level is crucial for monitoring forest ecosystems and planning forest management. Current monitoring methods involve ground measurements, requiring extensive cost, time and labor. Advances in drone remote sensing and computer vision offer great potential for mapping individual trees from aerial imagery at broad-scale. Large pre-trained vision models, such as the Segment Anything Model (SAM), represent a particularly compelling choice given limited labeled data. In this work, we compare methods leveraging SAM for the task of automatic tree crown instance segmentation in high resolution drone imagery in three use cases: 1) boreal plantations, 2) temperate forests and 3) tropical forests. We also study the integration of elevation data into models, in the form of Digital Surface Model (DSM) information, which can readily be obtained at no additional cost from RGB drone imagery. We present BalSAM, a model leveraging SAM and DSM information, which shows potential over other methods, particularly in the context of plantations. We find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts. However, efficiently tuning SAM end-to-end and integrating DSM information are both promising avenues for tree crown instance segmentation models.

Title: Evaluating the Impact of Privacy-Preserving Federated Learning on CAN Intrusion Detection

Authors: Gabriele Digregorio, Elisabetta Cainazzo, Stefano Longari, Michele Carminati, Stefano Zanero
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.04978
Pdf URL: https://arxiv.org/pdf/2506.04978
Copy Paste: [[2506.04978]] Evaluating the Impact of Privacy-Preserving Federated Learning on CAN Intrusion Detection(https://arxiv.org/abs/2506.04978)
Keywords: privacy, federate
Abstract: The challenges derived from the data-intensive nature of machine learning in conjunction with technologies that enable novel paradigms such as V2X and the potential offered by 5G communication, allow and justify the deployment of Federated Learning (FL) solutions in the vehicular intrusion detection domain. In this paper, we investigate the effects of integrating FL strategies into the machine learning-based intrusion detection process for on-board vehicular networks. Accordingly, we propose a FL implementation of a state-of-the-art Intrusion Detection System (IDS) for Controller Area Network (CAN), based on LSTM autoencoders. We thoroughly evaluate its detection efficiency and communication overhead, comparing it to a centralized version of the same algorithm, thereby presenting it as a feasible solution.

Title: Agentic AI for Intent-Based Industrial Automation

Authors: Marcos Lima Romero, Ricardo Suyama
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.04980
Pdf URL: https://arxiv.org/pdf/2506.04980
Copy Paste: [[2506.04980]] Agentic AI for Intent-Based Industrial Automation(https://arxiv.org/abs/2506.04980)
Keywords: explainability, large language model
Abstract: The recent development of Agentic AI systems, empowered by autonomous large language models (LLMs) agents with planning and tool-usage capabilities, enables new possibilities for the evolution of industrial automation and reduces the complexity introduced by Industry 4.0. This work proposes a conceptual framework that integrates Agentic AI with the intent-based paradigm, originally developed in network research, to simplify human-machine interaction (HMI) and better align automation systems with the human-centric, sustainable, and resilient principles of Industry 5.0. Based on the intent-based processing, the framework allows human operators to express high-level business or operational goals in natural language, which are decomposed into actionable components. These intents are broken into expectations, conditions, targets, context, and information that guide sub-agents equipped with specialized tools to execute domain-specific tasks. A proof of concept was implemented using the CMAPSS dataset and Google Agent Developer Kit (ADK), demonstrating the feasibility of intent decomposition, agent orchestration, and autonomous decision-making in predictive maintenance scenarios. The results confirm the potential of this approach to reduce technical barriers and enable scalable, intent-driven automation, despite data quality and explainability concerns.

Title: TextVidBench: A Benchmark for Long Video Scene Text Understanding

Authors: Yangyang Zhong, Ji Qi, Yuan Yao, Pengxin Luo, Yunfeng Yan, Donglian Qi, Zhiyuan Liu, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04983
Pdf URL: https://arxiv.org/pdf/2506.04983
Copy Paste: [[2506.04983]] TextVidBench: A Benchmark for Long Video Scene Text Understanding(https://arxiv.org/abs/2506.04983)
Keywords: large language model
Abstract: Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.

Title: FPTQuant: Function-Preserving Transforms for LLM Quantization

Authors: Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, Markus Nagel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04985
Pdf URL: https://arxiv.org/pdf/2506.04985
Copy Paste: [[2506.04985]] FPTQuant: Function-Preserving Transforms for LLM Quantization(https://arxiv.org/abs/2506.04985)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces four novel, lightweight, and expressive function-preserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a mergeable scaling transform within the MLP block, and (4) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model's function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no custom kernels and adds virtually no overhead during inference. The FPTs are trained both locally to reduce outliers, and end-to-end such that the outputs of the quantized and full-precision models match. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to 3.9 times over FP. Empirically, FPTQuant has an excellent accuracy-speed trade-off -- it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29% slower.

Title: Multi-scale Image Super Resolution with a Single Auto-Regressive Model

Authors: Enrique Sanchez, Isma Hadji, Adrian Bulat, Christos Tzelepis, Brais Martinez, Georgios Tzimiropoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04990
Pdf URL: https://arxiv.org/pdf/2506.04990
Copy Paste: [[2506.04990]] Multi-scale Image Super Resolution with a Single Auto-Regressive Model(https://arxiv.org/abs/2506.04990)
Keywords: transformer
Abstract: In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.

Title: PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Authors: Edoardo Bianchi, Antonio Liotta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04996
Pdf URL: https://arxiv.org/pdf/2506.04996
Copy Paste: [[2506.04996]] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment(https://arxiv.org/abs/2506.04996)
Keywords: segmentation
Abstract: Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

Title: SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View

Authors: Yongjie Xiao, Hongru Liang, Peixin Qin, Yao Zhang, Wenqiang Lei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05000
Pdf URL: https://arxiv.org/pdf/2506.05000
Copy Paste: [[2506.05000]] SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View(https://arxiv.org/abs/2506.05000)
Keywords: large language model
Abstract: Despite the great potential of large language models(LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable -- they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.

Title: Attack Effect Model based Malicious Behavior Detection

Authors: Limin Wang, Lei Bu, Muzimiao Zhang, Shihong Cang, Kai Ye
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05001
Pdf URL: https://arxiv.org/pdf/2506.05001
Copy Paste: [[2506.05001]] Attack Effect Model based Malicious Behavior Detection(https://arxiv.org/abs/2506.05001)
Keywords: security, attack
Abstract: Traditional security detection methods face three key challenges: inadequate data collection that misses critical security events, resource-intensive monitoring systems, and poor detection algorithms with high false positive rates. We present FEAD (Focus-Enhanced Attack Detection), a framework that addresses these issues through three innovations: (1) an attack model-driven approach that extracts security-critical monitoring items from online attack reports for comprehensive coverage; (2) efficient task decomposition that optimally distributes monitoring across existing collectors to minimize overhead; and (3) locality-aware anomaly analysis that leverages the clustering behavior of malicious activities in provenance graphs to improve detection accuracy. Evaluations demonstrate FEAD achieves 8.23% higher F1-score than existing solutions with only 5.4% overhead, confirming that focus-based designs significantly enhance detection performance.

Title: Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting

Authors: Alfred T. Christiansen, Andreas H. Højrup, Morten K. Stephansen, Md Ibtihaj A. Sakib, Taman S. Poojary, Filip Slezak, Morten S. Laursen, Thomas B. Moeslund, Joakim B. Haurum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05009
Pdf URL: https://arxiv.org/pdf/2506.05009
Copy Paste: [[2506.05009]] Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting(https://arxiv.org/abs/2506.05009)
Keywords: transformer, segmentation
Abstract: Training neural networks for tasks such as 3D point cloud semantic segmentation demands extensive datasets, yet obtaining and annotating real-world point clouds is costly and labor-intensive. This work aims to introduce a novel pipeline for generating realistic synthetic data, by leveraging 3D Gaussian Splatting (3DGS) and Gaussian Opacity Fields (GOF) to generate 3D assets of multiple different agricultural vehicles instead of using generic models. These assets are placed in a simulated environment, where the point clouds are generated using a simulated LiDAR. This is a flexible approach that allows changing the LiDAR specifications without incurring additional costs. We evaluated the impact of synthetic data on segmentation models such as PointNet++, Point Transformer V3, and OACNN, by training and validating the models only on synthetic data. Remarkably, the PTv3 model had an mIoU of 91.35\%, a noteworthy result given that the model had neither been trained nor validated on any real data. Further studies even suggested that in certain scenarios the models trained only on synthetically generated data performed better than models trained on real-world data. Finally, experiments demonstrated that the models can generalize across semantic classes, enabling accurate predictions on mesh models they were never trained on.

Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Authors: Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05010
Pdf URL: https://arxiv.org/pdf/2506.05010
Copy Paste: [[2506.05010]] ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development(https://arxiv.org/abs/2506.05010)
Keywords: large language model
Abstract: We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at this https URL.

Title: Identifying and Understanding Cross-Class Features in Adversarial Training

Authors: Zeming Wei, Yiwen Guo, Yisen Wang
Subjects: cs.LG, cs.AI, cs.CR, cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2506.05032
Pdf URL: https://arxiv.org/pdf/2506.05032
Copy Paste: [[2506.05032]] Identifying and Understanding Cross-Class Features in Adversarial Training(https://arxiv.org/abs/2506.05032)
Keywords: attack, robust
Abstract: Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at this https URL.

Title: Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

Authors: Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Xuetao Wei, Hailiang Huang, Yun Chen, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05038
Pdf URL: https://arxiv.org/pdf/2506.05038
Copy Paste: [[2506.05038]] Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers(https://arxiv.org/abs/2506.05038)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.

Title: FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Authors: Guangzhao Li, Yanming Yang, Chenxi Song, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05046
Pdf URL: https://arxiv.org/pdf/2506.05046
Copy Paste: [[2506.05046]] FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing(https://arxiv.org/abs/2506.05046)
Keywords: diffusion
Abstract: Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.

Title: TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages

Authors: Moshe Ofer, Orel Zamler, Amos Azaria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05057
Pdf URL: https://arxiv.org/pdf/2506.05057
Copy Paste: [[2506.05057]] TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages(https://arxiv.org/abs/2506.05057)
Keywords: transformer, large language model
Abstract: Large Language Models (LLMs) excel in high-resource languages but struggle with low-resource languages due to limited training data. This paper presents TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages), which integrates an LLM with two bilingual translation models. TALL transforms low-resource inputs into high-resource representations, leveraging the LLM's capabilities while preserving linguistic features through dimension alignment layers and custom transformers. Our experiments on Hebrew demonstrate significant improvements over several baselines, including direct use, naive translation, and fine-tuning approaches. The architecture employs a parameter-efficient strategy, freezing pre-trained components while training only lightweight adapter modules, balancing computational efficiency with performance gains.

Title: NIMO: a Nonlinear Interpretable MOdel

Authors: Shijian Xu, Marcello Massimo Negri, Volker Roth
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05059
Pdf URL: https://arxiv.org/pdf/2506.05059
Copy Paste: [[2506.05059]] NIMO: a Nonlinear Interpretable MOdel(https://arxiv.org/abs/2506.05059)
Keywords: interpretability
Abstract: Neural networks (NNs) have achieved tremendous success over the past decade, yet they are still extremely difficult to interpret. In contrast, linear models are less expressive but offer inherent interpretability. Linear coefficients are interpretable as the marginal effect of a feature on the prediction, assuming all other features are kept fixed. To combine the benefits of both approaches, we introduce NIMO (Nonlinear Interpretable MOdel). The key idea is to define a model where the NN is designed to learn nonlinear corrections to the linear model predictions, while also maintaining the original interpretability of the linear coefficients. Relevantly, we develop an optimization algorithm based on profile likelihood that elegantly allows for optimizing over the NN parameters while updating the linear coefficients analytically. By relying on adaptive ridge regression we can easily incorporate sparsity constraints as well. We show empirically that we can recover the underlying linear coefficients while significantly improving the predictive accuracy. Compared to other hybrid interpretable approaches, our model is the only one that actually maintains the same interpretability of linear coefficients as in linear models. We also achieve higher performance on various regression and classification settings.

Title: A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions

Authors: Anh Le, Thanh Lam, Dung Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05061
Pdf URL: https://arxiv.org/pdf/2506.05061
Copy Paste: [[2506.05061]] A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions(https://arxiv.org/abs/2506.05061)
Keywords: large language model
Abstract: Vietnamese document analysis and recognition (DAR) is a crucial field with applications in digitization, information retrieval, and automation. Despite advancements in OCR and NLP, Vietnamese text recognition faces unique challenges due to its complex diacritics, tonal variations, and lack of large-scale annotated datasets. Traditional OCR methods often struggle with real-world document variations, while deep learning approaches have shown promise but remain limited by data scarcity and generalization issues. Recently, large language models (LLMs) and vision-language models have demonstrated remarkable improvements in text recognition and document understanding, offering a new direction for Vietnamese DAR. However, challenges such as domain adaptation, multimodal learning, and computational efficiency persist. This survey provide a comprehensive review of existing techniques in Vietnamese document recognition, highlights key limitations, and explores how LLMs can revolutionize the field. We discuss future research directions, including dataset development, model optimization, and the integration of multimodal approaches for improved document intelligence. By addressing these gaps, we aim to foster advancements in Vietnamese DAR and encourage community-driven solutions.

Title: Does It Make Sense to Speak of Introspection in Large Language Models?

Authors: Iulia Comşa, Murray Shanahan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05068
Pdf URL: https://arxiv.org/pdf/2506.05068
Copy Paste: [[2506.05068]] Does It Make Sense to Speak of Introspection in Large Language Models?(https://arxiv.org/abs/2506.05068)
Keywords: large language model
Abstract: Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own ``creative'' writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.

Title: RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Authors: Tianjiao Li, Mengran Yu, Chenyu Shi, Yanjun Zhao, Xiaojing Liu, Qiang Zhang, Qi Zhang, Xuanjing Huang, Jiayin Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05070
Pdf URL: https://arxiv.org/pdf/2506.05070
Copy Paste: [[2506.05070]] RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation(https://arxiv.org/abs/2506.05070)
Keywords: large language model
Abstract: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.

Title: Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation

Authors: Soumitra Ghosh, Gopendra Vikram Singh, Shambhavi, Sabarna Choudhury, Asif Ekbal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05073
Pdf URL: https://arxiv.org/pdf/2506.05073
Copy Paste: [[2506.05073]] Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation(https://arxiv.org/abs/2506.05073)
Keywords: extraction, large language model
Abstract: Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs' comprehension of self-harm by distinguishing intent through nuanced language-emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with contextual self-harm interpretations and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b) fine-tunes LLMs for multi-task learning: self-harm detection (primary) and CM/SI span detection (auxiliary); c) generates explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: this https URL .

Title: Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin

Authors: HaoTian Lan
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05080
Pdf URL: https://arxiv.org/pdf/2506.05080
Copy Paste: [[2506.05080]] Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin(https://arxiv.org/abs/2506.05080)
Keywords: large language model
Abstract: The commercial vitality of community-scale streets in Chinese cities is shaped by complex interactions between vehicular accessibility, environmental quality, and pedestrian perception. This study proposes an interpretable, image-based framework to examine how street-level features -- including parked vehicle density, greenery, cleanliness, and street width -- impact retail performance and user satisfaction in Harbin, China. Leveraging street view imagery and a multimodal large language model (VisualGLM-6B), we construct a Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and analyze its relationship with spatial attributes extracted via GPT-4-based perception modeling. Our findings reveal that while moderate vehicle presence may enhance commercial access, excessive on-street parking -- especially in narrow streets -- erodes walkability and reduces both satisfaction and shop-level pricing. In contrast, streets with higher perceived greenery and cleanliness show significantly greater satisfaction scores but only weak associations with pricing. Street width moderates the effects of vehicle presence, underscoring the importance of spatial configuration. These results demonstrate the value of integrating AI-assisted perception with urban morphological analysis to capture non-linear and context-sensitive drivers of commercial success. This study advances both theoretical and methodological frontiers by highlighting the conditional role of vehicle activity in neighborhood commerce and demonstrating the feasibility of multimodal AI for perceptual urban diagnostics. The implications extend to urban design, parking management, and scalable planning tools for community revitalization.

Title: SeedEdit 3.0: Fast and High-Quality Generative Image Editing

Authors: Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05083
Pdf URL: https://arxiv.org/pdf/2506.05083
Copy Paste: [[2506.05083]] SeedEdit 3.0: Fast and High-Quality Generative Image Editing(https://arxiv.org/abs/2506.05083)
Keywords: diffusion, generative
Abstract: We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0 [22], which significantly improves over our previous version [27] in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and a reward loss. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).

Title: Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

Authors: HaoTian Lan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05087
Pdf URL: https://arxiv.org/pdf/2506.05087
Copy Paste: [[2506.05087]] Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics(https://arxiv.org/abs/2506.05087)
Keywords: transformer, large language model
Abstract: While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns -- such as the divergent perceptual effects of architectural transparency across residential and commercial zones -- revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.

Title: FG 2025 TrustFAA: the First Workshop on Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)

Authors: Jiaee Cheong, Yang Liu, Harold Soh, Hatice Gunes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05095
Pdf URL: https://arxiv.org/pdf/2506.05095
Copy Paste: [[2506.05095]] FG 2025 TrustFAA: the First Workshop on Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)(https://arxiv.org/abs/2506.05095)
Keywords: privacy, fair, interpretability, explainability
Abstract: With the increasing prevalence and deployment of Emotion AI-powered facial affect analysis (FAA) tools, concerns about the trustworthiness of these systems have become more prominent. This first workshop on "Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)" aims to bring together researchers who are investigating different challenges in relation to trustworthiness-such as interpretability, uncertainty, biases, and privacy-across various facial affect analysis tasks, including macro/ micro-expression recognition, facial action unit detection, other corresponding applications such as pain and depression detection, as well as human-robot interaction and collaboration. In alignment with FG2025's emphasis on ethics, as demonstrated by the inclusion of an Ethical Impact Statement requirement for this year's submissions, this workshop supports FG2025's efforts by encouraging research, discussion and dialogue on trustworthy FAA.

Title: Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Authors: Haosong Liu, Yuge Cheng, Zihan Liu, Aiyue Chen, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, Minyi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05096
Pdf URL: https://arxiv.org/pdf/2506.05096
Copy Paste: [[2506.05096]] Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers(https://arxiv.org/abs/2506.05096)
Keywords: diffusion, transformer
Abstract: Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).

Title: Privacy Amplification Through Synthetic Data: Insights from Linear Regression

Authors: Clément Pierquin, Aurélien Bellet, Marc Tommasi, Matthieu Boussard
Subjects: cs.LG, cs.CR, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05101
Pdf URL: https://arxiv.org/pdf/2506.05101
Copy Paste: [[2506.05101]] Privacy Amplification Through Synthetic Data: Insights from Linear Regression(https://arxiv.org/abs/2506.05101)
Keywords: privacy, generative
Abstract: Synthetic data inherits the differential privacy guarantees of the model used to generate it. Additionally, synthetic data may benefit from privacy amplification when the generative model is kept hidden. While empirical studies suggest this phenomenon, a rigorous theoretical understanding is still lacking. In this paper, we investigate this question through the well-understood framework of linear regression. First, we establish negative results showing that if an adversary controls the seed of the generative model, a single synthetic data point can leak as much information as releasing the model itself. Conversely, we show that when synthetic data is generated from random inputs, releasing a limited number of synthetic data points amplifies privacy beyond the model's inherent guarantees. We believe our findings in linear regression can serve as a foundation for deriving more general bounds in the future.

Title: DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

Authors: Revant Teotia, Candace Ross, Karen Ullrich, Sumit Chopra, Adriana Romero-Soriano, Melissa Hall, Matthew J. Muckley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05108
Pdf URL: https://arxiv.org/pdf/2506.05108
Copy Paste: [[2506.05108]] DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models(https://arxiv.org/abs/2506.05108)
Keywords: interpretability, generative, large language model
Abstract: Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity ("Does" the model generate images with expected attributes?) and generalization capacity ("Can" the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.

Title: Practical Manipulation Model for Robust Deepfake Detection

Authors: Benedikt Hopf, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05119
Pdf URL: https://arxiv.org/pdf/2506.05119
Copy Paste: [[2506.05119]] Practical Manipulation Model for Robust Deepfake Detection(https://arxiv.org/abs/2506.05119)
Keywords: robust
Abstract: Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors' generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model's robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of $3.51\%$ and $6.21\%$ AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at this https URL

Title: The NTNU System at the S&I Challenge 2025 SLA Open Track

Authors: Hong-Yun Lin, Tien-Hong Lo, Yu-Hsuan Fang, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, Berlin Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.05121
Pdf URL: https://arxiv.org/pdf/2506.05121
Copy Paste: [[2506.05121]] The NTNU System at the S&I Challenge 2025 SLA Open Track(https://arxiv.org/abs/2506.05121)
Keywords: interpretability, large language model
Abstract: A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.

Title: Membership Inference Attacks on Sequence Models

Authors: Lorenzo Rossi, Michael Aerni, Jie Zhang, Florian Tramèr
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05126
Pdf URL: https://arxiv.org/pdf/2506.05126
Copy Paste: [[2506.05126]] Membership Inference Attacks on Sequence Models(https://arxiv.org/abs/2506.05126)
Keywords: privacy, attack, membership infer, large language model
Abstract: Sequence models, such as Large Language Models (LLMs) and autoregressive image generators, have a tendency to memorize and inadvertently leak sensitive information. While this tendency has critical legal implications, existing tools are insufficient to audit the resulting risks. We hypothesize that those tools' shortcomings are due to mismatched assumptions. Thus, we argue that effectively measuring privacy leakage in sequence models requires leveraging the correlations inherent in sequential generation. To illustrate this, we adapt a state-of-the-art membership inference attack to explicitly model within-sequence correlations, thereby demonstrating how a strong existing attack can be naturally extended to suit the structure of sequence models. Through a case study, we show that our adaptations consistently improve the effectiveness of memorization audits without introducing additional computational costs. Our work hence serves as an important stepping stone toward reliable memorization audits for large sequence models.

Title: DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

Authors: Tanmay Parekh, Kartik Mehta, Ninareh Mehrabi, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05128
Pdf URL: https://arxiv.org/pdf/2506.05128
Copy Paste: [[2506.05128]] DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning(https://arxiv.org/abs/2506.05128)
Keywords: large language model
Abstract: Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline -- establishing DiCoRe as a strong zero-shot ED framework.

Title: Information Locality as an Inductive Bias for Neural Language Models

Authors: Taiga Someya, Anej Svete, Brian DuSell, Timothy J. O'Donnell, Mario Giulianelli, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05136
Pdf URL: https://arxiv.org/pdf/2506.05136
Copy Paste: [[2506.05136]] Information Locality as an Inductive Bias for Neural Language Models(https://arxiv.org/abs/2506.05136)
Keywords: transformer
Abstract: Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.

Title: Federated Isolation Forest for Efficient Anomaly Detection on Edge IoT Systems

Authors: Pavle Vasiljevic, Milica Matic, Miroslav Popovic
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2506.05138
Pdf URL: https://arxiv.org/pdf/2506.05138
Copy Paste: [[2506.05138]] Federated Isolation Forest for Efficient Anomaly Detection on Edge IoT Systems(https://arxiv.org/abs/2506.05138)
Keywords: privacy, federate
Abstract: Recently, federated learning frameworks such as Python TestBed for Federated Learning Algorithms and MicroPython TestBed for Federated Learning Algorithms have emerged to tackle user privacy concerns and efficiency in embedded systems. Even more recently, an efficient federated anomaly detection algorithm, FLiForest, based on Isolation Forests has been developed, offering a low-resource, unsupervised method well-suited for edge deployment and continuous learning. In this paper, we present an application of Isolation Forest-based temperature anomaly detection, developed using the previously mentioned federated learning frameworks, aimed at small edge devices and IoT systems running MicroPython. The system has been experimentally evaluated, achieving over 96% accuracy in distinguishing normal from abnormal readings and above 78% precision in detecting anomalies across all tested configurations, while maintaining a memory usage below 160 KB during model training. These results highlight its suitability for resource-constrained environments and edge systems, while upholding federated learning principles of data privacy and collaborative learning.

Title: Do Large Language Models Judge Error Severity Like Humans?

Authors: Diege Sun, Guanyi Chen, Fan Zhao, Xiaorong Cheng, Tingting He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05142
Pdf URL: https://arxiv.org/pdf/2506.05142
Copy Paste: [[2506.05142]] Do Large Language Models Judge Error Severity Like Humans?(https://arxiv.org/abs/2506.05142)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.

Title: Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation

Authors: Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lv
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.05154
Pdf URL: https://arxiv.org/pdf/2506.05154
Copy Paste: [[2506.05154]] Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation(https://arxiv.org/abs/2506.05154)
Keywords: robust, large language model
Abstract: Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at this https URL knowledgeable-r1.

Title: Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Authors: Bhavik Chandna, Zubair Bashir, Procheta Sen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05166
Pdf URL: https://arxiv.org/pdf/2506.05166
Copy Paste: [[2506.05166]] Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective(https://arxiv.org/abs/2506.05166)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

Title: ECoRAG: Evidentiality-guided Compression for Long Context RAG

Authors: Yeonseok Jeong, Jinsu Kim, Dohyeon Lee, Seung-won Hwang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.05167
Pdf URL: https://arxiv.org/pdf/2506.05167
Copy Paste: [[2506.05167]] ECoRAG: Evidentiality-guided Compression for Long Context RAG(https://arxiv.org/abs/2506.05167)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or \textbf{ECoRAG} framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at this https URL.

Title: Through-the-Wall Radar Human Activity Recognition WITHOUT Using Neural Networks

Authors: Weicheng Gao
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2506.05169
Pdf URL: https://arxiv.org/pdf/2506.05169
Copy Paste: [[2506.05169]] Through-the-Wall Radar Human Activity Recognition WITHOUT Using Neural Networks(https://arxiv.org/abs/2506.05169)
Keywords: interpretability, segmentation
Abstract: After a few years of research in the field of through-the-wall radar (TWR) human activity recognition (HAR), I found that we seem to be stuck in the mindset of training on radar image data through neural network models. The earliest related works in this field based on template matching did not require a training process, and I believe they have never died. Because these methods possess a strong physical interpretability and are closer to the basis of theoretical signal processing research. In this paper, I would like to try to return to the original path by attempting to eschew neural networks to achieve the TWR HAR task and challenge to achieve intelligent recognition as neural network models. In detail, the range-time map and Doppler-time map of TWR are first generated. Then, the initial regions of the human target foreground and noise background on the maps are determined using corner detection method, and the micro-Doppler signature is segmented using the multiphase active contour model. The micro-Doppler segmentation feature is discretized into a two-dimensional point cloud. Finally, the topological similarity between the resulting point cloud and the point clouds of the template data is calculated using Mapper algorithm to obtain the recognition results. The effectiveness of the proposed method is demonstrated by numerical simulated and measured experiments. The open-source code of this work is released at: this https URL.

Title: Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline

Authors: Yuzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, Yixuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05175
Pdf URL: https://arxiv.org/pdf/2506.05175
Copy Paste: [[2506.05175]] Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline(https://arxiv.org/abs/2506.05175)
Keywords: robust, segmentation
Abstract: Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos -- either by identifying anomalous frames or objects -- they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose a new framework called Track Any Anomalous Object (TAO), which introduces a granular video anomaly detection pipeline that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to downstream tasks such as segmentation and tracking, our method removes the need for threshold tuning and achieves more precise anomaly localization in long and complex video sequences. Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness. Project page available online.

Title: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Authors: Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05176
Pdf URL: https://arxiv.org/pdf/2506.05176
Copy Paste: [[2506.05176]] Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models(https://arxiv.org/abs/2506.05176)
Keywords: robust
Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

Title: Associative Memory and Generative Diffusion in the Zero-noise Limit

Authors: Joshua Hess, Quaid Morris
Subjects: cs.LG, cond-mat.dis-nn, math.DS, nlin.AO, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.05178
Pdf URL: https://arxiv.org/pdf/2506.05178
Copy Paste: [[2506.05178]] Associative Memory and Generative Diffusion in the Zero-noise Limit(https://arxiv.org/abs/2506.05178)
Keywords: diffusion, generative
Abstract: Connections between generative diffusion and continuous-state associative memory models are studied. Morse-Smale dynamical systems are emphasized as universal approximators of gradient-based associative memory models and diffusion models as white-noise perturbed systems thereof. Universal properties of associative memory that follow from this description are described and used to characterize a generic transition from generation to memory as noise levels diminish. Structural stability inherited by Morse-Smale flows is shown to imply a notion of stability for diffusions at vanishing noise levels. Applied to one- and two-parameter families of gradients, this indicates stability at all but isolated points of associative memory learning landscapes and the learning and generation landscapes of diffusion models with gradient drift in the zero-noise limit, at which small sets of generic bifurcations characterize qualitative transitions between stable systems. Examples illustrating the characterization of these landscapes by sequences of these bifurcations are given, along with structural stability criterion for classic and modern Hopfield networks (equivalently, the attention mechanism).

Title: TreeRPO: Tree Relative Policy Optimization

Authors: Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05183
Pdf URL: https://arxiv.org/pdf/2506.05183
Copy Paste: [[2506.05183]] TreeRPO: Tree Relative Policy Optimization(https://arxiv.org/abs/2506.05183)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0\% to 35.5\%. Furthermore, \name significantly outperforms GRPO by 2.9\% in performance while simultaneously reducing the average response length by 18.1\%, showcasing its effectiveness and efficiency. Our code will be available at \href{this https URL}{this https URL}.

Title: Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis

Authors: Neeraj Kumar, Swaraj Nanda, Siddharth Singi, Jamal Benhamida, David Kim, Jie-Fu Chen, Amir Momeni-Boroujeni, Gregory M. Goldgof, Gabriele Campanella, Chad Vanderbilt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05184
Pdf URL: https://arxiv.org/pdf/2506.05184
Copy Paste: [[2506.05184]] Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis(https://arxiv.org/abs/2506.05184)
Keywords: transformer
Abstract: Pathology foundation models (PFMs) have emerged as powerful tools for analyzing whole slide images (WSIs). However, adapting these pretrained PFMs for specific clinical tasks presents considerable challenges, primarily due to the availability of only weak (WSI-level) labels for gigapixel images, necessitating multiple instance learning (MIL) paradigm for effective WSI analysis. This paper proposes a novel approach for single-GPU \textbf{T}ask \textbf{A}daptation of \textbf{PFM}s (TAPFM) that uses vision transformer (\vit) attention for MIL aggregation while optimizing both for feature representations and attention weights. The proposed approach maintains separate computational graphs for MIL aggregator and the PFM to create stable training dynamics that align with downstream task objectives during end-to-end adaptation. Evaluated on mutation prediction tasks for bladder cancer and lung adenocarcinoma across institutional and TCGA cohorts, TAPFM consistently outperforms conventional approaches, with H-Optimus-0 (TAPFM) outperforming the benchmarks. TAPFM effectively handles multi-label classification of actionable mutations as well. Thus, TAPFM makes adaptation of powerful pre-trained PFMs practical on standard hardware for various clinical applications.

Title: Counterfactual reasoning: an analysis of in-context emergence

Authors: Moritz Miller, Bernhard Schölkopf, Siyuan Guo
Subjects: cs.CL, cs.AI, cs.LG, math.ST
Abstract URL: https://arxiv.org/abs/2506.05188
Pdf URL: https://arxiv.org/pdf/2506.05188
Copy Paste: [[2506.05188]] Counterfactual reasoning: an analysis of in-context emergence(https://arxiv.org/abs/2506.05188)
Keywords: transformer
Abstract: Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under this https URL .

Title: Locality Preserving Markovian Transition for Instance Retrieval

Authors: Jifei Luo, Wenzheng Wu, Hantao Yao, Lu Yu, Changsheng Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05196
Pdf URL: https://arxiv.org/pdf/2506.05196
Copy Paste: [[2506.05196]] Locality Preserving Markovian Transition for Instance Retrieval(https://arxiv.org/abs/2506.05196)
Keywords: diffusion
Abstract: Diffusion-based re-ranking methods are effective in modeling the data manifolds through similarity propagation in affinity graphs. However, positive signals tend to diminish over several steps away from the source, reducing discriminative power beyond local regions. To address this issue, we introduce the Locality Preserving Markovian Transition (LPMT) framework, which employs a long-term thermodynamic transition process with multiple states for accurate manifold distance measurement. The proposed LPMT first integrates diffusion processes across separate graphs using Bidirectional Collaborative Diffusion (BCD) to establish strong similarity relationships. Afterwards, Locality State Embedding (LSE) encodes each instance into a distribution for enhanced local consistency. These distributions are interconnected via the Thermodynamic Markovian Transition (TMT) process, enabling efficient global retrieval while maintaining local effectiveness. Experimental results across diverse tasks confirm the effectiveness of LPMT for instance retrieval.

Title: Quantifying Cross-Modality Memorization in Vision-Language Models

Authors: Yuxin Wen, Yangsibo Huang, Tom Goldstein, Ravi Kumar, Badih Ghazi, Chiyuan Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05198
Pdf URL: https://arxiv.org/pdf/2506.05198
Copy Paste: [[2506.05198]] Quantifying Cross-Modality Memorization in Vision-Language Models(https://arxiv.org/abs/2506.05198)
Keywords: robust, diffusion, large language model
Abstract: Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.

Title: Transformers Meet In-Context Learning: A Universal Approximation Theory

Authors: Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05200
Pdf URL: https://arxiv.org/pdf/2506.05200
Copy Paste: [[2506.05200]] Transformers Meet In-Context Learning: A Universal Approximation Theory(https://arxiv.org/abs/2506.05200)
Keywords: transformer, large language model
Abstract: Modern large language models are capable of in-context learning, the ability to perform new tasks at inference time using only a handful of input-output examples in the prompt, without any fine-tuning or parameter updates. We develop a universal approximation theory to better understand how transformers enable in-context learning. For any class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can perform reliable prediction given only a few in-context examples. In contrast to much of the recent literature that frames transformers as algorithm approximators -- i.e., constructing transformers to emulate the iterations of optimization algorithms as a means to approximate solutions of learning problems -- our work adopts a fundamentally different approach rooted in universal function approximation. This alternative approach offers approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being approximated, thereby extending far beyond convex problems and linear function classes. Our construction sheds light on how transformers can simultaneously learn general-purpose representations and adapt dynamically to in-context examples.

Title: OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

Authors: Yanbo Wang, Ziyi Wang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05204
Pdf URL: https://arxiv.org/pdf/2506.05204
Copy Paste: [[2506.05204]] OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View(https://arxiv.org/abs/2506.05204)
Keywords: diffusion, generative
Abstract: Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.

Title: RELIC: Evaluating Compositional Instruction Following via Language Recognition

Authors: Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05205
Pdf URL: https://arxiv.org/pdf/2506.05205
Copy Paste: [[2506.05205]] RELIC: Evaluating Compositional Instruction Following via Language Recognition(https://arxiv.org/abs/2506.05205)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs' ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs' skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.

Title: Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Authors: Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05207
Pdf URL: https://arxiv.org/pdf/2506.05207
Copy Paste: [[2506.05207]] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning(https://arxiv.org/abs/2506.05207)
Keywords: diffusion, transformer
Abstract: Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex this http URL, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.

Title: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Authors: Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05209
Pdf URL: https://arxiv.org/pdf/2506.05209
Copy Paste: [[2506.05209]] The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text(https://arxiv.org/abs/2506.05209)
Keywords: large language model
Abstract: Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

Title: Learning Theory of Decentralized Robust Kernel-Based Learning Algorithm

Authors: Zhan Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05215
Pdf URL: https://arxiv.org/pdf/2506.05215
Copy Paste: [[2506.05215]] Learning Theory of Decentralized Robust Kernel-Based Learning Algorithm(https://arxiv.org/abs/2506.05215)
Keywords: robust
Abstract: We propose a new decentralized robust kernel-based learning algorithm within the framework of reproducing kernel Hilbert space (RKHS) by utilizing a networked system that can be represented as a connected graph. The robust loss function $\mathcal{L}_\sigma$ induced by a windowing function $W$ and a robustness scaling parameter $\sigma>0$, can encompass a broad spectrum of robust losses. Consequently, the proposed algorithm effectively provides a unified decentralized learning framework for robust regression, which fundamentally differs from the existing distributed robust kernel learning schemes, all of which are divide-and-conquer based. We rigorously establish the learning theory and offer a comprehensive convergence analysis for the algorithm. We show each local robust estimator generated from the decentralized algorithm can be utilized to approximate the regression function. Based on kernel-based integral operator techniques, we derive general high confidence convergence bounds for each local approximating sequence in terms of the mean square distance, RKHS norm, and generalization error, respectively. Moreover, we provide rigorous selection rules for local sample size and show that, under properly selected step size and scaling parameter $\sigma$, the decentralized robust algorithm can achieve optimal learning rates (up to logarithmic factors) in both norms. The parameter $\sigma$ is shown to be essential for enhancing robustness while also ensuring favorable convergence behavior. The intrinsic connection among decentralization, sample selection, robustness of the algorithm, and its convergence is clearly reflected.

Title: DSG-World: Learning a 3D Gaussian World Model from Dual State Videos

Authors: Wenhao Hu, Xuexiang Wen, Xi Li, Gaoang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05217
Pdf URL: https://arxiv.org/pdf/2506.05217
Copy Paste: [[2506.05217]] DSG-World: Learning a 3D Gaussian World Model from Dual State Videos(https://arxiv.org/abs/2506.05217)
Keywords: generative, segmentation
Abstract: Building an efficient and physically consistent world model from limited observations is a long standing challenge in vision and robotics. Many existing world modeling pipelines are based on implicit generative models, which are hard to train and often lack 3D or physical consistency. On the other hand, explicit 3D methods built from a single state often require multi-stage processing-such as segmentation, background completion, and inpainting-due to occlusions. To address this, we leverage two perturbed observations of the same scene under different object configurations. These dual states offer complementary visibility, alleviating occlusion issues during state transitions and enabling more stable and complete reconstruction. In this paper, we present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations. Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency. We further introduce a pseudo intermediate state for symmetric alignment and design collaborative co-pruning trategies to refine geometric completeness. DSG-World enables efficient real-to-simulation transfer purely in the explicit Gaussian representation space, supporting high-fidelity rendering and object-level scene manipulation without relying on dense observations or multi-stage pipelines. Extensive experiments demonstrate strong generalization to novel views and scene states, highlighting the effectiveness of our approach for real-world 3D reconstruction and simulation.

Title: SAM-aware Test-time Adaptation for Universal Medical Image Segmentation

Authors: Jianghao Wu, Yicheng Wu, Yutong Xie, Wenjia Bai, You Zhang, Feilong Tang, Yulong Li, Yasmeen George, Imran Razzak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05221
Pdf URL: https://arxiv.org/pdf/2506.05221
Copy Paste: [[2506.05221]] SAM-aware Test-time Adaptation for Universal Medical Image Segmentation(https://arxiv.org/abs/2506.05221)
Keywords: segmentation
Abstract: Universal medical image segmentation using the Segment Anything Model (SAM) remains challenging due to its limited adaptability to medical domains. Existing adaptations, such as MedSAM, enhance SAM's performance in medical imaging but at the cost of reduced generalization to unseen data. Therefore, in this paper, we propose SAM-aware Test-Time Adaptation (SAM-TTA), a fundamentally different pipeline that preserves the generalization of SAM while improving its segmentation performance in medical imaging via a test-time framework. SAM-TTA tackles two key challenges: (1) input-level discrepancies caused by differences in image acquisition between natural and medical images and (2) semantic-level discrepancies due to fundamental differences in object definition between natural and medical domains (e.g., clear boundaries vs. ambiguous structures). Specifically, our SAM-TTA framework comprises (1) Self-adaptive Bezier Curve-based Transformation (SBCT), which adaptively converts single-channel medical images into three-channel SAM-compatible inputs while maintaining structural integrity, to mitigate the input gap between medical and natural images, and (2) Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT), which employs consistency learning to align SAM's internal representations to medical semantics, enabling efficient adaptation without auxiliary supervision or expensive retraining. Extensive experiments on five public datasets demonstrate that our SAM-TTA outperforms existing TTA approaches and even surpasses fully fine-tuned models such as MedSAM in certain scenarios, establishing a new paradigm for universal medical image segmentation. Code can be found at this https URL.

Title: Improving Low-Resource Morphological Inflection via Self-Supervised Objectives

Authors: Adam Wiemerslage, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05227
Pdf URL: https://arxiv.org/pdf/2506.05227
Copy Paste: [[2506.05227]] Improving Low-Resource Morphological Inflection via Self-Supervised Objectives(https://arxiv.org/abs/2506.05227)
Keywords: transformer
Abstract: Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world's languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection -- a character-level task highly relevant for language documentation -- in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.

Title: Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05229
Pdf URL: https://arxiv.org/pdf/2506.05229
Copy Paste: [[2506.05229]] Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts(https://arxiv.org/abs/2506.05229)
Keywords: transformer
Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

Title: Progressive Tempering Sampler with Diffusion

Authors: Severi Rissanen, RuiKang OuYang, Jiajun He, Wenlin Chen, Markus Heinonen, Arno Solin, José Miguel Hernández-Lobato
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05231
Pdf URL: https://arxiv.org/pdf/2506.05231
Copy Paste: [[2506.05231]] Progressive Tempering Sampler with Diffusion(https://arxiv.org/abs/2506.05231)
Keywords: diffusion
Abstract: Recent research has focused on designing neural samplers that amortize the process of sampling from unnormalized densities. However, despite significant advancements, they still fall short of the state-of-the-art MCMC approach, Parallel Tempering (PT), when it comes to the efficiency of target evaluations. On the other hand, unlike a well-trained neural sampler, PT yields only dependent samples and needs to be rerun -- at considerable computational cost -- whenever new samples are required. To address these weaknesses, we propose the Progressive Tempering Sampler with Diffusion (PTSD), which trains diffusion models sequentially across temperatures, leveraging the advantages of PT to improve the training of neural samplers. We also introduce a novel method to combine high-temperature diffusion models to generate approximate lower-temperature samples, which are minimally refined using MCMC and used to train the next diffusion model. PTSD enables efficient reuse of sample information across temperature levels while generating well-mixed, uncorrelated samples. Our method significantly improves target evaluation efficiency, outperforming diffusion-based neural samplers.

Title: MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Authors: Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, João Sacramento
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05233
Pdf URL: https://arxiv.org/pdf/2506.05233
Copy Paste: [[2506.05233]] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training(https://arxiv.org/abs/2506.05233)
Keywords: transformer
Abstract: Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

Title: Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

Authors: Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05239
Pdf URL: https://arxiv.org/pdf/2506.05239
Copy Paste: [[2506.05239]] Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit(https://arxiv.org/abs/2506.05239)
Keywords: extraction, interpretability
Abstract: Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we introduce a multi-iteration SAE by unrolling Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.

Title: Aligning Latent Spaces with Flow Priors

Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05240
Pdf URL: https://arxiv.org/pdf/2506.05240
Copy Paste: [[2506.05240]] Aligning Latent Spaces with Flow Priors(https://arxiv.org/abs/2506.05240)
Keywords: generative
Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.

Title: SECNEURON: Reliable and Flexible Abuse Control in Local LLMs via Hybrid Neuron Encryption

Authors: Zhiqiang Wang, Haohua Du, Junyang Wang, Haifeng Sun, Kaiwen Guo, Haikuo Yu, Chao Liu, Xiang-Yang Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05242
Pdf URL: https://arxiv.org/pdf/2506.05242
Copy Paste: [[2506.05242]] SECNEURON: Reliable and Flexible Abuse Control in Local LLMs via Hybrid Neuron Encryption(https://arxiv.org/abs/2506.05242)
Keywords: security, extraction, membership infer, large language model
Abstract: Large language models (LLMs) with diverse capabilities are increasingly being deployed in local environments, presenting significant security and controllability challenges. These locally deployed LLMs operate outside the direct control of developers, rendering them more susceptible to abuse. Existing mitigation techniques mainly designed for cloud-based LLM services are frequently circumvented or ineffective in deployer-controlled environments. We propose SECNEURON, the first framework that seamlessly embeds classic access control within the intrinsic capabilities of LLMs, achieving reliable, cost-effective, flexible, and certified abuse control for local deployed LLMs. SECNEURON employs neuron-level encryption and selective decryption to dynamically control the task-specific capabilities of LLMs, limiting unauthorized task abuse without compromising others. We first design a task-specific neuron extraction mechanism to decouple logically related neurons and construct a layered policy tree for handling coupled neurons. We then introduce a flexible and efficient hybrid encryption framework for millions of neurons in LLMs. Finally, we developed a distribution-based decrypted neuron detection mechanism on ciphertext to ensure the effectiveness of partially decrypted LLMs. We proved that SECNEURON satisfies IND-CPA Security and Collusion Resistance Security under the Task Controllability Principle. Experiments on various task settings show that SECNEURON limits unauthorized task accuracy to below 25% while keeping authorized accuracy loss with 2%. Using an unauthorized Code task example, the accuracy of abuse-related malicious code generation was reduced from 59% to 15%. SECNEURON also mitigates unauthorized data leakage, reducing PII extraction rates to below 5% and membership inference to random guesses.

Title: On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

Authors: Zhen Qin, Jinxin Zhou, Zhihui Zhu
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2506.05249
Pdf URL: https://arxiv.org/pdf/2506.05249
Copy Paste: [[2506.05249]] On the Convergence of Gradient Descent on Learning Transformers with Residual Connections(https://arxiv.org/abs/2506.05249)
Keywords: transformer
Abstract: Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

Title: Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

Authors: Zhiyun Deng, Dongmyeong Lee, Amanda Adkins, Jesse Quattrociocchi, Christian Ellis, Joydeep Biswas
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.05250
Pdf URL: https://arxiv.org/pdf/2506.05250
Copy Paste: [[2506.05250]] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains(https://arxiv.org/abs/2506.05250)
Keywords: robust
Abstract: Robust cross-view 3-DoF localization in GPS-denied, off-road environments remains challenging due to (1) perceptual ambiguities from repetitive vegetation and unstructured terrain, and (2) seasonal shifts that significantly alter scene appearance, hindering alignment with outdated satellite imagery. To address this, we introduce MoViX, a self-supervised cross-view video localization framework that learns viewpoint- and season-invariant representations while preserving directional awareness essential for accurate localization. MoViX employs a pose-dependent positive sampling strategy to enhance directional discrimination and temporally aligned hard negative mining to discourage shortcut learning from seasonal cues. A motion-informed frame sampler selects spatially diverse frames, and a lightweight temporal aggregator emphasizes geometrically aligned observations while downweighting ambiguous ones. At inference, MoViX runs within a Monte Carlo Localization framework, using a learned cross-view matching module in place of handcrafted models. Entropy-guided temperature scaling enables robust multi-hypothesis tracking and confident convergence under visual ambiguity. We evaluate MoViX on the TartanDrive 2.0 dataset, training on under 30 minutes of data and testing over 12.29 km. Despite outdated satellite imagery, MoViX localizes within 25 meters of ground truth 93% of the time, and within 50 meters 100% of the time in unseen regions, outperforming state-of-the-art baselines without environment-specific tuning. We further demonstrate generalization on a real-world off-road dataset from a geographically distinct site with a different robot platform.

Title: Conservative classifiers do consistently well with improving agents: characterizing statistical and online learning

Authors: Dravyansh Sharma, Alec Sun
Subjects: cs.LG, cs.GT, cs.MA
Abstract URL: https://arxiv.org/abs/2506.05252
Pdf URL: https://arxiv.org/pdf/2506.05252
Copy Paste: [[2506.05252]] Conservative classifiers do consistently well with improving agents: characterizing statistical and online learning(https://arxiv.org/abs/2506.05252)
Keywords: generative
Abstract: Machine learning is now ubiquitous in societal decision-making, for example in evaluating job candidates or loan applications, and it is increasingly important to take into account how classified agents will react to the learning algorithms. The majority of recent literature on strategic classification has focused on reducing and countering deceptive behaviors by the classified agents, but recent work of Attias et al. identifies surprising properties of learnability when the agents genuinely improve in order to attain the desirable classification, such as smaller generalization error than standard PAC-learning. In this paper we characterize so-called learnability with improvements across multiple new axes. We introduce an asymmetric variant of minimally consistent concept classes and use it to provide an exact characterization of proper learning with improvements in the realizable setting. While prior work studies learnability only under general, arbitrary agent improvement regions, we give positive results for more natural Euclidean ball improvement sets. In particular, we characterize improper learning under a mild generative assumption on the data distribution. We further show how to learn in more challenging settings, achieving lower generalization error under well-studied bounded noise models and obtaining mistake bounds in realizable and agnostic online learning. We resolve open questions posed by Attias et al. for both proper and improper learning.

Title: LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs

Authors: Xiaodong Wang, Jinfa Huang, Li Yuan, Peixi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05260
Pdf URL: https://arxiv.org/pdf/2506.05260
Copy Paste: [[2506.05260]] LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs(https://arxiv.org/abs/2506.05260)
Keywords: large language model
Abstract: Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $\log \pi_\theta (y_w\mid x)$ and $\log \pi_\theta (y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.

Title: Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards?

Authors: Juan E. Tapia, Christoph Busch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05263
Pdf URL: https://arxiv.org/pdf/2506.05263
Copy Paste: [[2506.05263]] Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards?(https://arxiv.org/abs/2506.05263)
Keywords: privacy, protect, attack
Abstract: Nowadays, one of the main challenges in presentation attack detection (PAD) on ID cards is obtaining generalisation capabilities for a diversity of countries that are issuing ID cards. Most PAD systems are trained on one, two, or three ID documents because of privacy protection concerns. As a result, they do not obtain competitive results for commercial purposes when tested in an unknown new ID card country. In this scenario, Foundation Models (FM) trained on huge datasets can help to improve generalisation capabilities. This work intends to improve and benchmark the capabilities of FM and how to use them to adapt the generalisation on PAD of ID Documents. Different test protocols were used, considering zero-shot and fine-tuning and two different ID card datasets. One private dataset based on Chilean IDs and one open-set based on three ID countries: Finland, Spain, and Slovakia. Our findings indicate that bona fide images are the key to generalisation.

Title: Tight analyses of first-order methods with error feedback

Authors: Daniel Berg Thomsen, Adrien Taylor, Aymeric Dieuleveut
Subjects: cs.LG, cs.DC, math.OC
Abstract URL: https://arxiv.org/abs/2506.05271
Pdf URL: https://arxiv.org/pdf/2506.05271
Copy Paste: [[2506.05271]] Tight analyses of first-order methods with error feedback(https://arxiv.org/abs/2506.05271)
Keywords: fair
Abstract: Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in a simplified yet representative setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

Title: How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control

Authors: Hao Yu, Chu Xin Cheng, Runlong Yu, Yuyang Ye, Shiwei Tong, Zhaofeng Liu, Defu Lian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05276
Pdf URL: https://arxiv.org/pdf/2506.05276
Copy Paste: [[2506.05276]] How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control(https://arxiv.org/abs/2506.05276)
Keywords: diffusion, generative
Abstract: Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) - making precise modifications while preserving temporal coherence - consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, flexible control across different types of constraints. This framework combines two key mechanisms: a confidence-weighted anchor control for point-wise constraints and a classifier-based control for managing statistical properties such as sums and averages over segments. Our methods achieve precise local control during the denoising inference stage while maintaining temporal coherence and integrating seamlessly, with any conditionally trained diffusion-based time series models. Extensive experiments across diverse datasets and models demonstrate its effectiveness. Our work bridges the gap between pure generative modeling and real-world time series editing needs, offering a flexible solution for human-in-the-loop time series generation and editing. The code and demo are provided for validation.

Title: Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Authors: Nan Huo, Jinyang Li, Bowen Qin, Ge Qu, Xiaolong Li, Xiaodong Li, Chenhao Ma, Reynold Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05278
Pdf URL: https://arxiv.org/pdf/2506.05278
Copy Paste: [[2506.05278]] Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning(https://arxiv.org/abs/2506.05278)
Keywords: robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.

Title: Fast-DataShapley: Neural Modeling for Training Data Valuation

Authors: Haifeng Sun, Yu Xiong, Runze Wu, Xinyu Cai, Changjie Fan, Lan Zhang, Xiang-Yang Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05281
Pdf URL: https://arxiv.org/pdf/2506.05281
Copy Paste: [[2506.05281]] Fast-DataShapley: Neural Modeling for Training Data Valuation(https://arxiv.org/abs/2506.05281)
Keywords: protect, fair
Abstract: The value and copyright of training data are crucial in the artificial intelligence industry. Service platforms should protect data providers' legitimate rights and fairly reward them for their contributions. Shapley value, a potent tool for evaluating contributions, outperforms other methods in theory, but its computational overhead escalates exponentially with the number of data providers. Recent works based on Shapley values attempt to mitigate computation complexity by approximation algorithms. However, they need to retrain for each test sample, leading to intolerable costs. We propose Fast-DataShapley, a one-pass training method that leverages the weighted least squares characterization of the Shapley value to train a reusable explainer model with real-time reasoning speed. Given new test samples, no retraining is required to calculate the Shapley values of the training data. Additionally, we propose three methods with theoretical guarantees to reduce training overhead from two aspects: the approximate calculation of the utility function and the group calculation of the training data. We analyze time complexity to show the efficiency of our methods. The experimental evaluations on various image datasets demonstrate superior performance and efficiency compared to baselines. Specifically, the performance is improved to more than 2.5 times, and the explainer's training speed can be increased by two orders of magnitude.

Title: Rectified Point Flow: Generic Point Cloud Pose Estimation

Authors: Tao Sun, Liyuan Zhu, Shengyu Huang, Shuran Song, Iro Armeni
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.05282
Pdf URL: https://arxiv.org/pdf/2506.05282
Copy Paste: [[2506.05282]] Rectified Point Flow: Generic Point Cloud Pose Estimation(https://arxiv.org/abs/2506.05282)
Keywords: generative
Abstract: We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: this https URL.

Title: RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion

Authors: Bardienus P. Duisterhof, Jan Oberst, Bowen Wen, Stan Birchfield, Deva Ramanan, Jeffrey Ichnowski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05285
Pdf URL: https://arxiv.org/pdf/2506.05285
Copy Paste: [[2506.05285]] RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion(https://arxiv.org/abs/2506.05285)
Keywords: transformer
Abstract: 3D shape completion has broad applications in robotics, digital twin reconstruction, and extended reality (XR). Although recent advances in 3D object and scene completion have achieved impressive results, existing methods lack 3D consistency, are computationally expensive, and struggle to capture sharp object boundaries. Our work (RaySt3R) addresses these limitations by recasting 3D shape completion as a novel view synthesis problem. Specifically, given a single RGB-D image and a novel viewpoint (encoded as a collection of query rays), we train a feedforward transformer to predict depth maps, object masks, and per-pixel confidence scores for those query rays. RaySt3R fuses these predictions across multiple query views to reconstruct complete 3D shapes. We evaluate RaySt3R on synthetic and real-world datasets, and observe it achieves state-of-the-art performance, outperforming the baselines on all datasets by up to 44% in 3D chamfer distance. Project page: this https URL

Title: Stable Vision Concept Transformers for Medical Diagnosis

Authors: Lijie Hu, Songning Lai, Yuan Hua, Shu Yang, Jingfeng Zhang, Di Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05286
Pdf URL: https://arxiv.org/pdf/2506.05286
Copy Paste: [[2506.05286]] Stable Vision Concept Transformers for Medical Diagnosis(https://arxiv.org/abs/2506.05286)
Keywords: diffusion, transformer
Abstract: Transparency is a paramount concern in the medical field, prompting researchers to delve into the realm of explainable AI (XAI). Among these XAI methods, Concept Bottleneck Models (CBMs) aim to restrict the model's latent space to human-understandable high-level concepts by generating a conceptual layer for extracting conceptual features, which has drawn much attention recently. However, existing methods rely solely on concept features to determine the model's predictions, which overlook the intrinsic feature embeddings within medical images. To address this utility gap between the original models and concept-based models, we propose Vision Concept Transformer (VCT). Furthermore, despite their benefits, CBMs have been found to negatively impact model performance and fail to provide stable explanations when faced with input perturbations, which limits their application in the medical field. To address this faithfulness issue, this paper further proposes the Stable Vision Concept Transformer (SVCT) based on VCT, which leverages the vision transformer (ViT) as its backbone and incorporates a conceptual layer. SVCT employs conceptual features to enhance decision-making capabilities by fusing them with image features and ensures model faithfulness through the integration of Denoised Diffusion Smoothing. Comprehensive experiments on four medical datasets demonstrate that our VCT and SVCT maintain accuracy while remaining interpretable compared to baselines. Furthermore, even when subjected to perturbations, our SVCT model consistently provides faithful explanations, thus meeting the needs of the medical field.

Title: EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Authors: Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05287
Pdf URL: https://arxiv.org/pdf/2506.05287
Copy Paste: [[2506.05287]] EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?(https://arxiv.org/abs/2506.05287)
Keywords: robust, large language model
Abstract: The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

Title: AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Authors: Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05289
Pdf URL: https://arxiv.org/pdf/2506.05289
Copy Paste: [[2506.05289]] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model(https://arxiv.org/abs/2506.05289)
Keywords: diffusion
Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at this https URL.

Title: Big Bird: Privacy Budget Management for W3C's Privacy-Preserving Attribution API

Authors: Pierre Tholoniat, Alison Caulfield, Giorgio Cavicchioli, Mark Chen, Nikos Goutzoulias, Benjamin Case, Asaf Cidon, Roxana Geambasu, Mathias Lécuyer, Martin Thomson
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05290
Pdf URL: https://arxiv.org/pdf/2506.05290
Copy Paste: [[2506.05290]] Big Bird: Privacy Budget Management for W3C's Privacy-Preserving Attribution API(https://arxiv.org/abs/2506.05290)
Keywords: privacy, protect, robust
Abstract: Privacy-preserving advertising APIs like Privacy-Preserving Attribution (PPA) are designed to enhance web privacy while enabling effective ad measurement. PPA offers an alternative to cross-site tracking with encrypted reports governed by differential privacy (DP), but current designs lack a principled approach to privacy budget management, creating uncertainty around critical design decisions. We present Big Bird, a privacy budget manager for PPA that clarifies per-site budget semantics and introduces a global budgeting system grounded in resource isolation principles. Big Bird enforces utility-preserving limits via quota budgets and improves global budget utilization through a novel batched scheduling algorithm. Together, these mechanisms establish a robust foundation for enforcing privacy protections in adversarial environments. We implement Big Bird in Firefox and evaluate it on real-world ad data, demonstrating its resilience and effectiveness.

Title: A Smooth Sea Never Made a Skilled $\texttt{SAILOR}$: Robust Imitation via Learning to Search

Authors: Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, Gokul Swamy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05294
Pdf URL: https://arxiv.org/pdf/2506.05294
Copy Paste: [[2506.05294]] A Smooth Sea Never Made a Skilled $\texttt{SAILOR}$: Robust Imitation via Learning to Search(https://arxiv.org/abs/2506.05294)
Keywords: robust, diffusion
Abstract: The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach $\texttt{SAILOR}$ consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10$\times$ still leaves a performance gap. We find that $\texttt{SAILOR}$ can identify nuanced failures and is robust to reward hacking. Our code is available at this https URL .

Title: Sample Complexity and Representation Ability of Test-time Scaling Paradigms

Authors: Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05295
Pdf URL: https://arxiv.org/pdf/2506.05295
Copy Paste: [[2506.05295]] Sample Complexity and Representation Ability of Test-time Scaling Paradigms(https://arxiv.org/abs/2506.05295)
Keywords: transformer, large language model
Abstract: Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $\Theta(1/\Delta^2)$ samples to produce the correct answer, while best-of-$n$ only needs $\Theta(1/\Delta)$, where $\Delta < 1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.

Title: Power Law Guided Dynamic Sifting for Efficient Attention

Authors: Nirav Koley, Prajwal Singhania, Abhinav Bhatele
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05300
Pdf URL: https://arxiv.org/pdf/2506.05300
Copy Paste: [[2506.05300]] Power Law Guided Dynamic Sifting for Efficient Attention(https://arxiv.org/abs/2506.05300)
Keywords: large language model
Abstract: Efficient inference on GPUs using large language models remains challenging due to memory bandwidth limitations, particularly during data transfers between High Bandwidth Memory (HBM) and SRAM in attention computations. Approximate attention methods address this issue by reducing computational and memory overhead but often rely on expensive top-$k$ operations, which perform poorly on GPUs. We propose SiftAttention, a novel approximate attention method that replaces the top-$k$ step with a computationally efficient element-wise filtering operation based on a threshold value. Our intuition for doing this is based on our empirical observation that the $\tau$-th quantile of attention scores follows a predictable power-law over sequential generation steps. Exploiting this insight, our approach dynamically estimates a threshold value per prompt at each generation step. Only attention scores above this threshold and their corresponding value vectors are loaded/used to compute the attention output, reducing data movement between HBM and SRAM. Our evaluation demonstrates that SiftAttention preserves model quality better than existing approximate attention methods while reducing memory bandwidth usage when loading value vectors.

Title: SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Authors: Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05301
Pdf URL: https://arxiv.org/pdf/2506.05301
Copy Paste: [[2506.05301]] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training(https://arxiv.org/abs/2506.05301)
Keywords: diffusion
Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

Title: Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Authors: Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05302
Pdf URL: https://arxiv.org/pdf/2506.05302
Copy Paste: [[2506.05302]] Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos(https://arxiv.org/abs/2506.05302)
Keywords: robust, large language model, segmentation
Abstract: We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.

Title: ProRefine: Inference-time Prompt Refinement with Textual Feedback

Authors: Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Christopher M. Homan, Wei Wei
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05305
Pdf URL: https://arxiv.org/pdf/2506.05305
Copy Paste: [[2506.05305]] ProRefine: Inference-time Prompt Refinement with Textual Feedback(https://arxiv.org/abs/2506.05305)
Keywords: large language model
Abstract: Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, are becoming increasingly prevalent. However, these workflows often suffer from error propagation and sub-optimal performance, largely due to poorly designed prompts that fail to effectively guide individual agents. This is a critical problem because it limits the reliability and scalability of these powerful systems. We introduce ProRefine, an innovative inference-time prompt optimization method that leverages textual feedback from large language models (LLMs) to address this challenge. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to match the performance of larger ones, highlighting its potential for efficient and scalable AI deployment, and democratizing access to high-performing AI.

Title: Learning normalized image densities via dual score matching

Authors: Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05310
Pdf URL: https://arxiv.org/pdf/2506.05310
Copy Paste: [[2506.05310]] Learning normalized image densities via dual score matching(https://arxiv.org/abs/2506.05310)
Keywords: diffusion, generative
Abstract: Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: estimated log probabilities are nearly independent of the specific images in the training set. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary significantly with image content, in contrast with traditional assumptions such as concentration of measure or support on a low-dimensional manifold.

Title: Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Authors: Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05314
Pdf URL: https://arxiv.org/pdf/2506.05314
Copy Paste: [[2506.05314]] Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models(https://arxiv.org/abs/2506.05314)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.

Title: Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Authors: Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05316
Pdf URL: https://arxiv.org/pdf/2506.05316
Copy Paste: [[2506.05316]] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay(https://arxiv.org/abs/2506.05316)
Keywords: large language model
Abstract: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.

Title: LSM-2: Learning from Incomplete Wearable Sensor Data

Authors: Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, Daniel McDuff
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05321
Pdf URL: https://arxiv.org/pdf/2506.05321
Copy Paste: [[2506.05321]] LSM-2: Learning from Incomplete Wearable Sensor Data(https://arxiv.org/abs/2506.05321)
Keywords: robust, generative
Abstract: Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM), a novel SSL approach that learns robust representations directly from incomplete data without requiring explicit imputation. AIM's core novelty lies in its use of learnable mask tokens to model both existing ("inherited") and artificially introduced missingness, enabling it to robustly handle fragmented real-world data during inference. Pre-trained on an extensive dataset of 40M hours of day-long multimodal sensor data, our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling. Furthermore, LSM-2 with AIM exhibits superior scaling performance, and critically, maintains high performance even under targeted missingness scenarios, reflecting clinically coherent patterns, such as the diagnostic value of nighttime biosignals for hypertension prediction. This makes AIM a more reliable choice for real-world wearable data applications.

Title: Seeing the Invisible: Machine learning-Based QPI Kernel Extraction via Latent Alignment

Authors: Yingshuai Ji, Haomin Zhuang, Matthew Toole, James McKenzie, Xiaolong Liu, Xiangliang Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05325
Pdf URL: https://arxiv.org/pdf/2506.05325
Copy Paste: [[2506.05325]] Seeing the Invisible: Machine learning-Based QPI Kernel Extraction via Latent Alignment(https://arxiv.org/abs/2506.05325)
Keywords: robust, extraction
Abstract: Quasiparticle interference (QPI) imaging is a powerful tool for probing electronic structures in quantum materials, but extracting the single-scatterer QPI pattern (i.e., the kernel) from a multi-scatterer image remains a fundamentally ill-posed inverse problem. In this work, we propose the first AI-based framework for QPI kernel extraction. We introduce a two-step learning strategy that decouples kernel representation learning from observation-to-kernel inference. In the first step, we train a variational autoencoder to learn a compact latent space of scattering kernels. In the second step, we align the latent representation of QPI observations with those of the pre-learned kernels using a dedicated encoder. This design enables the model to infer kernels robustly even under complex, entangled scattering conditions. We construct a diverse and physically realistic QPI dataset comprising 100 unique kernels and evaluate our method against a direct one-step baseline. Experimental results demonstrate that our approach achieves significantly higher extraction accuracy, and improved generalization to unseen kernels.

Title: Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

Authors: Duochao Shi, Weijie Wang, Donny Y. Chen, Zeyu Zhang, Jia-Wang Bian, Bohan Zhuang, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05327
Pdf URL: https://arxiv.org/pdf/2506.05327
Copy Paste: [[2506.05327]] Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting(https://arxiv.org/abs/2506.05327)
Keywords: transformer
Abstract: Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering quality -- a well-known limitation of depth-based representations. To tackle this issue, we introduce PM-Loss, a novel regularization loss based on a pointmap predicted by a pre-trained transformer. Although the pointmap itself may be less accurate than the depth map, it effectively enforces geometric smoothness, especially around object boundaries. With the improved depth map, our method significantly improves the feed-forward 3DGS across various architectures and scenes, delivering consistently better rendering results. Our project page: this https URL

Title: MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Authors: Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05331
Pdf URL: https://arxiv.org/pdf/2506.05331
Copy Paste: [[2506.05331]] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning(https://arxiv.org/abs/2506.05331)
Keywords: large language model
Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at this https URL

Title: Search Arena: Analyzing Search-Augmented LLMs

Authors: Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05334
Pdf URL: https://arxiv.org/pdf/2506.05334
Copy Paste: [[2506.05334]] Search Arena: Analyzing Search-Augmented LLMs(https://arxiv.org/abs/2506.05334)
Keywords: large language model
Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: this https URL.

Title: VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Authors: Ghazi Shazan Ahmad, Ahmed Heakl, Hanan Gani, Abdelrahman Shaker, Zhiqiang Shen, Ranjay Krishna, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05336
Pdf URL: https://arxiv.org/pdf/2506.05336
Copy Paste: [[2506.05336]] VideoMolmo: Spatio-Temporal Grounding Meets Pointing(https://arxiv.org/abs/2506.05336)
Keywords: interpretability, large language model, segmentation
Abstract: Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at this https URL.

Title: Exploring Diffusion Transformer Designs via Grafting

Authors: Keshigeyan Chandrasegaran, Michael Poli, Daniel Y. Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Li Fei-Fei
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05340
Pdf URL: https://arxiv.org/pdf/2506.05340
Copy Paste: [[2506.05340]] Exploring Diffusion Transformer Designs via Grafting(https://arxiv.org/abs/2506.05340)
Keywords: diffusion, transformer
Abstract: Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 for DiT-XL/2) using <2% pretraining compute. We then graft a text-to-image model (PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2x and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring. Code and grafted models: this https URL

Title: Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Authors: Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, Bo Dai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05341
Pdf URL: https://arxiv.org/pdf/2506.05341
Copy Paste: [[2506.05341]] Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning(https://arxiv.org/abs/2506.05341)
Keywords: generative, large language model
Abstract: Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.

Title: Refer to Anything with Vision-Language Prompts

Authors: Shengcao Cao, Zijun Wei, Jason Kuen, Kangning Liu, Lingzhi Zhang, Jiuxiang Gu, HyunJoon Jung, Liang-Yan Gui, Yu-Xiong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05342
Pdf URL: https://arxiv.org/pdf/2506.05342
Copy Paste: [[2506.05342]] Refer to Anything with Vision-Language Prompts(https://arxiv.org/abs/2506.05342)
Keywords: segmentation
Abstract: Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: this https URL.

Title: SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Authors: Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05344
Pdf URL: https://arxiv.org/pdf/2506.05344
Copy Paste: [[2506.05344]] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs(https://arxiv.org/abs/2506.05344)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at this https URL.

Title: Inference-Time Hyper-Scaling with KV Cache Compression

Authors: Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05345
Pdf URL: https://arxiv.org/pdf/2506.05345
Copy Paste: [[2506.05345]] Inference-Time Hyper-Scaling with KV Cache Compression(https://arxiv.org/abs/2506.05345)
Keywords: transformer
Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

Title: Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Authors: Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
Subjects: cs.CR, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05346
Pdf URL: https://arxiv.org/pdf/2506.05346
Copy Paste: [[2506.05346]] Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets(https://arxiv.org/abs/2506.05346)
Keywords: attack, robust, large language model
Abstract: Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.

Title: Contrastive Flow Matching

Authors: George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, Judy Hoffman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05350
Pdf URL: https://arxiv.org/pdf/2506.05350
Copy Paste: [[2506.05350]] Contrastive Flow Matching(https://arxiv.org/abs/2506.05350)
Keywords: diffusion
Abstract: Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: this https URL.