2025-06-13

Title: Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Authors: Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj
Subjects: cs.CV, cs.AI, cs.CL, cs.GR, cs.MM
Abstract URL: https://arxiv.org/abs/2506.10005
Pdf URL: https://arxiv.org/pdf/2506.10005
Copy Paste: [[2506.10005]] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models(https://arxiv.org/abs/2506.10005)
Keywords: diffusion, generative
Abstract: Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.

Title: NOCL: Node-Oriented Conceptualization LLM for Graph Tasks without Message Passing

Authors: Wei Li, Mengcheng Lan, Jiaxing Xu, Yiping Ke
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10014
Pdf URL: https://arxiv.org/pdf/2506.10014
Copy Paste: [[2506.10014]] NOCL: Node-Oriented Conceptualization LLM for Graph Tasks without Message Passing(https://arxiv.org/abs/2506.10014)
Keywords: large language model
Abstract: Graphs are essential for modeling complex interactions across domains such as social networks, biology, and recommendation systems. Traditional Graph Neural Networks, particularly Message Passing Neural Networks (MPNNs), rely heavily on supervised learning, limiting their generalization and applicability in label-scarce scenarios. Recent self-supervised approaches still require labeled fine-tuning, limiting their effectiveness in zero-shot scenarios. Meanwhile, Large Language Models (LLMs) excel in natural language tasks but face significant challenges when applied to graphs, including preserving reasoning abilities, managing extensive token lengths from rich node attributes, and being limited to textual-attributed graphs (TAGs) and a single level task. To overcome these limitations, we propose the Node-Oriented Conceptualization LLM (NOCL), a novel framework that leverages two core techniques: 1) node description, which converts heterogeneous node attributes into structured natural language, extending LLM from TAGs to non-TAGs; 2) node concept, which encodes node descriptions into compact semantic embeddings using pretrained language models, significantly reducing token lengths by up to 93.9% compared to directly using node descriptions. Additionally, our NOCL employs graph representation descriptors to unify graph tasks at various levels into a shared, language-based query format, paving a new direction for Graph Foundation Models. Experimental results validate NOCL's competitive supervised performance relative to traditional MPNNs and hybrid LLM-MPNN methods and demonstrate superior generalization in zero-shot settings.

Title: A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations

Authors: Tian Lan, Yang-Hao Zhou, Zi-Ao Ma, Fanshu Sun, Rui-Qing Sun, Junyu Luo, Rong-Cheng Tu, Heyan Huang, Chen Xu, Zhijing Wu, Xian-Ling Mao
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10019
Pdf URL: https://arxiv.org/pdf/2506.10019
Copy Paste: [[2506.10019]] A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations(https://arxiv.org/abs/2506.10019)
Keywords: generative
Abstract: Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.

Title: From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

Authors: Kyubyung Chae, Hyunbin Jin, Taesup Kim
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10020
Pdf URL: https://arxiv.org/pdf/2506.10020
Copy Paste: [[2506.10020]] From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment(https://arxiv.org/abs/2506.10020)
Keywords: attack, robust, large language model
Abstract: Safely aligning large language models (LLMs) often demands extensive human-labeled preference data, a process that's both costly and time-consuming. While synthetic data offers a promising alternative, current methods frequently rely on complex iterative prompting or auxiliary models. To address this, we introduce Refusal-Aware Adaptive Injection (RAAI), a straightforward, training-free, and model-agnostic framework that repurposes LLM attack techniques. RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions. Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average across four benchmarks. Crucially, fine-tuning LLMs with the synthetic data generated by RAAI improves model robustness against harmful prompts while preserving general capabilities on standard tasks like MMLU and ARC. This work highlights how LLM attack methodologies can be reframed as practical tools for scalable and controllable safety alignment.

Title: LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges

Authors: Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, Xuelong Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10022
Pdf URL: https://arxiv.org/pdf/2506.10022
Copy Paste: [[2506.10022]] LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges(https://arxiv.org/abs/2506.10022)
Keywords: security, attack, robust, large language model
Abstract: The widespread adoption of Large Language Models (LLMs) has heightened concerns about their security, particularly their vulnerability to jailbreak attacks that leverage crafted prompts to generate malicious outputs. While prior research has been conducted on general security capabilities of LLMs, their specific susceptibility to jailbreak attacks in code generation remains largely unexplored. To fill this gap, we propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation, designed to evaluate LLM robustness against such threats. MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories. Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities: specifically, the average rejection rate for malicious content is 60.93%, dropping to 39.92% when combined with jailbreak attack algorithms. Our work highlights that the code security capabilities of LLMs still pose significant challenges.

Title: Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models

Authors: Elena Sofia Ruzzetti, Giancarlo A. Xompero, Davide Venditti, Fabio Massimo Zanzotto
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10024
Pdf URL: https://arxiv.org/pdf/2506.10024
Copy Paste: [[2506.10024]] Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models(https://arxiv.org/abs/2506.10024)
Keywords: privacy, defense, attack, robust, extraction, large language model
Abstract: Large Language Models (LLMs) memorize, and thus, among huge amounts of uncontrolled data, may memorize Personally Identifiable Information (PII), which should not be stored and, consequently, not leaked. In this paper, we introduce Private Memorization Editing (PME), an approach for preventing private data leakage that turns an apparent limitation, that is, the LLMs' memorization ability, into a powerful privacy defense strategy. While attacks against LLMs have been performed exploiting previous knowledge regarding their training data, our approach aims to exploit the same kind of knowledge in order to make a model more robust. We detect a memorized PII and then mitigate the memorization of PII by editing a model knowledge of its training data. We verify that our procedure does not affect the underlying language model while making it more robust against privacy Training Data Extraction attacks. We demonstrate that PME can effectively reduce the number of leaked PII in a number of configurations, in some cases even reducing the accuracy of the privacy attacks to zero.

Title: Mind the Gap: Revealing Security Barriers through Situational Awareness of Small and Medium Business Key Decision-Makers

Authors: Yuanhaur Chang, Oren Heller, Yaniv Shlomo, Iddo Bar-Noy, Ella Bokobza, Michal Grinstein-Weiss, Ning Zhang
Subjects: cs.CR, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2506.10025
Pdf URL: https://arxiv.org/pdf/2506.10025
Copy Paste: [[2506.10025]] Mind the Gap: Revealing Security Barriers through Situational Awareness of Small and Medium Business Key Decision-Makers(https://arxiv.org/abs/2506.10025)
Keywords: security, defense
Abstract: Key decision-makers in small and medium businesses (SMBs) often lack the awareness and knowledge to implement cybersecurity measures effectively. To gain a deeper understanding of how SMB executives navigate cybersecurity decision-making, we deployed a mixed-method approach, conducting semi-structured interviews (n=21) and online surveys (n=322) with SMB key decision-makers. Using thematic analysis, we revealed SMB decision-makers' perceived risks in terms of the digital assets they valued, and found reasons for their choice of defense measures and factors impacting security perception. We employed the situational awareness model to characterize decision-makers based on cybersecurity awareness, identifying those who have comparatively low awareness in the fight against adversaries. We further explored the relationship between awareness and business attributes, and constructed a holistic structural equation model to understand how awareness can be improved. Finally, we proposed interventions to help SMBs overcome potential challenges.

Title: Secure Data Access in Cloud Environments Using Quantum Cryptography

Authors: S. Vasavi Venkata Lakshmi, Ziaul Haque Choudhury
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.10028
Pdf URL: https://arxiv.org/pdf/2506.10028
Copy Paste: [[2506.10028]] Secure Data Access in Cloud Environments Using Quantum Cryptography(https://arxiv.org/abs/2506.10028)
Keywords: secure, security, protect, defense
Abstract: Cloud computing has made storing and accessing data easier but keeping it secure is a big challenge nowadays. Traditional methods of ensuring data may not be strong enough in the future when powerful quantum computers become available. To solve this problem, this study uses quantum cryptography to protect data in the cloud environment. Quantum Key Distribution (QKD) creates secure keys by sending information using quantum particles like photons. Specifically, we use the BB84 protocol, a simple and reliable way to make secure keys that cannot be stolen without detection. To protect the data, we use the Quantum One Time pad (QOTP) for encryption and decryption, ensuring the data stays completely private. This study shows how these Quantum methods can be applied in cloud systems to provide a strong defense against hackers, even if they have access to quantum computers. The combination of QKD, BB84, and QOTP creates a safe and reliable way to keep data secure when it is stored or shared in the cloud. Using quantum cryptography, this paper provides a way to ensure data security now and in the future, making cloud computing safer for everyone to store their data securely and safely.

Title: Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks

Authors: Rafaël Nouailles (GdR)
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10029
Pdf URL: https://arxiv.org/pdf/2506.10029
Copy Paste: [[2506.10029]] Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks(https://arxiv.org/abs/2506.10029)
Keywords: security, attack, large language model
Abstract: Large Language models (LLMs) are transforming digital usage, particularly in text generation, image creation, information retrieval and code development. ChatGPT, launched by OpenAI in November 2022, quickly became a reference, prompting the emergence of competitors such as Google's Gemini. However, these technological advances raise new cybersecurity challenges, including prompt injection attacks, the circumvention of regulatory measures (jailbreaking), the spread of misinformation (hallucinations) and risks associated with deep fakes. This paper presents a comparative analysis of the security and alignment levels of ChatGPT and Gemini, as well as a taxonomy of jailbreak techniques associated with experiments.

Title: Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment

Authors: Tianyu Chen, Jian Lou, Wenjie Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10030
Pdf URL: https://arxiv.org/pdf/2506.10030
Copy Paste: [[2506.10030]] Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment(https://arxiv.org/abs/2506.10030)
Keywords: protect, robust, steal, watermark
Abstract: As Retrieval-Augmented Generation (RAG) evolves into service-oriented platforms (Rag-as-a-Service) with shared knowledge bases, protecting the copyright of contributed data becomes essential. Existing watermarking methods in RAG focus solely on textual knowledge, leaving image knowledge unprotected. In this work, we propose AQUA, the first watermark framework for image knowledge protection in Multimodal RAG systems. AQUA embeds semantic signals into synthetic images using two complementary methods: acronym-based triggers and spatial relationship cues. These techniques ensure watermark signals survive indirect watermark propagation from image retriever to textual generator, being efficient, effective and imperceptible. Experiments across diverse models and datasets show that AQUA enables robust, stealthy, and reliable copyright tracing, filling a key gap in multimodal RAG protection.

Title: Multiverse Privacy Theory for Contextual Risks in Complex User-AI Interactions

Authors: Ece Gumusel
Subjects: cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2506.10042
Pdf URL: https://arxiv.org/pdf/2506.10042
Copy Paste: [[2506.10042]] Multiverse Privacy Theory for Contextual Risks in Complex User-AI Interactions(https://arxiv.org/abs/2506.10042)
Keywords: privacy
Abstract: In an era of increasing interaction with artificial intelligence (AI), users face evolving privacy decisions shaped by complex, uncertain factors. This paper introduces Multiverse Privacy Theory, a novel framework in which each privacy decision spawns a parallel universe, representing a distinct potential outcome based on user choices over time. By simulating these universes, this theory provides a foundation for understanding privacy through the lens of contextual integrity, evolving preferences, and probabilistic decision-making. Future work will explore its application using real-world, scenario-based survey data.

Title: GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

Authors: Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10047
Pdf URL: https://arxiv.org/pdf/2506.10047
Copy Paste: [[2506.10047]] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models(https://arxiv.org/abs/2506.10047)
Keywords: attack, diffusion, large language model
Abstract: Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical and concerning safety weaknesses.

Title: Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs

Authors: Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.10054
Pdf URL: https://arxiv.org/pdf/2506.10054
Copy Paste: [[2506.10054]] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs(https://arxiv.org/abs/2506.10054)
Keywords: robust
Abstract: Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at this https URL.

Title: Textual Bayes: Quantifying Uncertainty in LLM-Based Systems

Authors: Brendan Leigh Ross, Noël Vouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, Jesse C. Cresswell
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.10060
Pdf URL: https://arxiv.org/pdf/2506.10060
Copy Paste: [[2506.10060]] Textual Bayes: Quantifying Uncertainty in LLM-Based Systems(https://arxiv.org/abs/2506.10060)
Keywords: large language model
Abstract: Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problem, which limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the model's textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inference, a difficult problem even for well-studied data modalities, we introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.

Title: A quantum semantic framework for natural language processing

Authors: Christopher J. Agostino, Quan Le Thien, Molly Apsel, Denizhan Pak, Elina Lesyk, Ashabari Majumdar
Subjects: cs.CL, cs.AI, cs.IR, cs.IT
Abstract URL: https://arxiv.org/abs/2506.10077
Pdf URL: https://arxiv.org/pdf/2506.10077
Copy Paste: [[2506.10077]] A quantum semantic framework for natural language processing(https://arxiv.org/abs/2506.10077)
Keywords: large language model
Abstract: Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. Large Language Models (LLMs) and other modern NLP systems face inherent limitations precisely because they operate within natural language itself, making them subject to the same interpretive constraints imposed by semantic degeneracy. In this work, we argue using Kolmogorov complexity that as an expression's complexity grows, the likelihood of any interpreting agent (human or LLM-powered AI) recovering the single intended meaning vanishes. This computational intractability suggests the classical view that linguistic forms possess meaning in and of themselves is flawed. We alternatively posit that meaning is instead actualized through an observer-dependent interpretive act. To test this, we conducted a semantic Bell inequality test using diverse LLM agents as ``computational cognitive systems'' to interpret ambiguous word pairs under varied contextual settings. Across several independent experiments, we found average CHSH expectation values ranging from 1.2 to 2.8, with several runs yielding values (e.g., 2.3-2.4) that significantly violate the classical boundary ($|S|\leq2$). This demonstrates that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.

Title: LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

Authors: Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10082
Pdf URL: https://arxiv.org/pdf/2506.10082
Copy Paste: [[2506.10082]] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning(https://arxiv.org/abs/2506.10082)
Keywords: diffusion
Abstract: Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.

Title: DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding

Authors: Bin Guo, John H.L. Hansen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10084
Pdf URL: https://arxiv.org/pdf/2506.10084
Copy Paste: [[2506.10084]] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding(https://arxiv.org/abs/2506.10084)
Keywords: robust
Abstract: Conventional vision backbones, despite their success, often construct features through a largely uniform cascade of operations, offering limited explicit pathways for adaptive, iterative refinement. This raises a compelling question: can principles from classical search algorithms instill a more algorithmic, structured, and logical processing flow within these networks, leading to representations built through more interpretable, perhaps reasoning-like decision processes? We introduce DeepTraverse, a novel vision architecture directly inspired by algorithmic search strategies, enabling it to learn features through a process of systematic elucidation and adaptive refinement distinct from conventional approaches. DeepTraverse operationalizes this via two key synergistic components: recursive exploration modules that methodically deepen feature analysis along promising representational paths with parameter sharing for efficiency, and adaptive calibration modules that dynamically adjust feature salience based on evolving global context. The resulting algorithmic interplay allows DeepTraverse to intelligently construct and refine feature patterns. Comprehensive evaluations across a diverse suite of image classification benchmarks show that DeepTraverse achieves highly competitive classification accuracy and robust feature discrimination, often outperforming conventional models with similar or larger parameter counts. Our work demonstrates that integrating such algorithmic priors provides a principled and effective strategy for building more efficient, performant, and structured vision backbones.

Title: Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Authors: Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Dhaval Patel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10086
Pdf URL: https://arxiv.org/pdf/2506.10086
Copy Paste: [[2506.10086]] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information(https://arxiv.org/abs/2506.10086)
Keywords: large language model
Abstract: This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.

Title: Optimizing Latent Dimension Allocation in Hierarchical VAEs: Balancing Attenuation and Information Retention for OOD Detection

Authors: Dane Williamson, Yangfeng Ji, Matthew Dwyer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10089
Pdf URL: https://arxiv.org/pdf/2506.10089
Copy Paste: [[2506.10089]] Optimizing Latent Dimension Allocation in Hierarchical VAEs: Balancing Attenuation and Information Retention for OOD Detection(https://arxiv.org/abs/2506.10089)
Keywords: robust, generative
Abstract: Out-of-distribution (OOD) detection is a critical task in machine learning, particularly for safety-critical applications where unexpected inputs must be reliably flagged. While hierarchical variational autoencoders (HVAEs) offer improved representational capacity over traditional VAEs, their performance is highly sensitive to how latent dimensions are distributed across layers. Existing approaches often allocate latent capacity arbitrarily, leading to ineffective representations or posterior collapse. In this work, we introduce a theoretically grounded framework for optimizing latent dimension allocation in HVAEs, drawing on principles from information theory to formalize the trade-off between information loss and representational attenuation. We prove the existence of an optimal allocation ratio $r^{\ast}$ under a fixed latent budget, and empirically show that tuning this ratio consistently improves OOD detection performance across datasets and architectures. Our approach outperforms baseline HVAE configurations and provides practical guidance for principled latent structure design, leading to more robust OOD detection with deep generative models.

Title: When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs

Authors: Xiao Li, Joel Kreuzwieser, Alan Peters
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10095
Pdf URL: https://arxiv.org/pdf/2506.10095
Copy Paste: [[2506.10095]] When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs(https://arxiv.org/abs/2506.10095)
Keywords: large language model
Abstract: We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.

Title: EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

Authors: Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10100
Pdf URL: https://arxiv.org/pdf/2506.10100
Copy Paste: [[2506.10100]] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models(https://arxiv.org/abs/2506.10100)
Keywords: diffusion
Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.

Title: Learning to Collaborate Over Graphs: A Selective Federated Multi-Task Learning Approach

Authors: Ahmed Elbakary, Chaouki Ben Issaid, Mehdi Bennis
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2506.10102
Pdf URL: https://arxiv.org/pdf/2506.10102
Copy Paste: [[2506.10102]] Learning to Collaborate Over Graphs: A Selective Federated Multi-Task Learning Approach(https://arxiv.org/abs/2506.10102)
Keywords: federate, fair
Abstract: We present a novel federated multi-task learning method that leverages cross-client similarity to enable personalized learning for each client. To avoid transmitting the entire model to the parameter server, we propose a communication-efficient scheme that introduces a feature anchor, a compact vector representation that summarizes the features learned from the client's local classes. This feature anchor is shared with the server to account for local clients' distribution. In addition, the clients share the classification heads, a lightweight linear layer, and perform a graph-based regularization to enable collaboration among clients. By modeling collaboration between clients as a dynamic graph and continuously updating and refining this graph, we can account for any drift from the clients. To ensure beneficial knowledge transfer and prevent negative collaboration, we leverage a community detection-based approach that partitions this dynamic graph into homogeneous communities, maximizing the sum of task similarities, represented as the graph edges' weights, within each community. This mechanism restricts collaboration to highly similar clients within their formed communities, ensuring positive interaction and preserving personalization. Extensive experiments on two heterogeneous datasets demonstrate that our method significantly outperforms state-of-the-art baselines. Furthermore, we show that our method exhibits superior computation and communication efficiency and promotes fairness across clients.

Title: Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection

Authors: David Farr, Kevin Talty, Alexandra Farr, John Stockdale, Iain Cruickshank, Jevin West
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.10104
Pdf URL: https://arxiv.org/pdf/2506.10104
Copy Paste: [[2506.10104]] Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection(https://arxiv.org/abs/2506.10104)
Keywords: secure, security, robust, interpretability, large language model
Abstract: As cyber threats become more sophisticated, rapid and accurate vulnerability detection is essential for maintaining secure systems. This study explores the use of Large Language Models (LLMs) in software vulnerability assessment by simulating the identification of Python code with known Common Weakness Enumerations (CWEs), comparing zero-shot, few-shot cross-domain, and few-shot in-domain prompting strategies. Our results indicate that while zero-shot prompting performs poorly, few-shot prompting significantly enhances classification performance, particularly when integrated with confidence-based routing strategies that improve efficiency by directing human experts to cases where model uncertainty is high, optimizing the balance between automation and expert oversight. We find that LLMs can effectively generalize across vulnerability categories with minimal examples, suggesting their potential as scalable, adaptable cybersecurity tools in simulated environments. However, challenges such as model reliability, interpretability, and adversarial robustness remain critical areas for future research. By integrating AI-driven approaches with expert-in-the-loop (EITL) decision-making, this work highlights a pathway toward more efficient and responsive cybersecurity workflows. Our findings provide a foundation for deploying AI-assisted vulnerability detection systems in both real and simulated environments that enhance operational resilience while reducing the burden on human analysts.

Title: NnD: Diffusion-based Generation of Physically-Nonnegative Objects

Authors: Nadav Torem, Tamar Sde-Chen, Yoav Y. Schechner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10112
Pdf URL: https://arxiv.org/pdf/2506.10112
Copy Paste: [[2506.10112]] NnD: Diffusion-based Generation of Physically-Nonnegative Objects(https://arxiv.org/abs/2506.10112)
Keywords: diffusion, generative
Abstract: Most natural objects have inherent complexity and variability. While some simple objects can be modeled from first principles, many real-world phenomena, such as cloud formation, require computationally expensive simulations that limit scalability. This work focuses on a class of physically meaningful, nonnegative objects that are computationally tractable but costly to simulate. To dramatically reduce computational costs, we propose nonnegative diffusion (NnD). This is a learned generative model using score based diffusion. It adapts annealed Langevin dynamics to enforce, by design, non-negativity throughout iterative scene generation and analysis (inference). NnD trains on high-quality physically simulated objects. Once trained, it can be used for generation and inference. We demonstrate generation of 3D volumetric clouds, comprising inherently nonnegative microphysical fields. Our generated clouds are consistent with cloud physics trends. They are effectively not distinguished as non-physical by expert perception.

Title: ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering

Authors: Caijun Jia, Nan Xu, Jingxuan Wei, Qingli Wang, Lei Wang, Bihui Yu, Junnan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10116
Pdf URL: https://arxiv.org/pdf/2506.10116
Copy Paste: [[2506.10116]] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering(https://arxiv.org/abs/2506.10116)
Keywords: large language model
Abstract: Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.

Title: Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers

Authors: Natanael Lucena, Fábio S. da Silva, Ricardo Rios
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10119
Pdf URL: https://arxiv.org/pdf/2506.10119
Copy Paste: [[2506.10119]] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers(https://arxiv.org/abs/2506.10119)
Keywords: transformer
Abstract: This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.

Title: GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments

Authors: Maryam Khalid, Akane Sano
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2506.10120
Pdf URL: https://arxiv.org/pdf/2506.10120
Copy Paste: [[2506.10120]] GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments(https://arxiv.org/abs/2506.10120)
Keywords: fair
Abstract: Graph-based Active Learning (AL) leverages the structure of graphs to efficiently prioritize label queries, reducing labeling costs and user burden in applications like health monitoring, human behavior analysis, and sensor networks. By identifying strategically positioned nodes, graph AL minimizes data collection demands while maintaining model performance, making it a valuable tool for dynamic environments. Despite its potential, existing graph AL methods are often evaluated on static graph datasets and primarily focus on prediction accuracy, neglecting user-centric considerations such as sampling diversity, query fairness, and adaptability to dynamic settings. To bridge this gap, we introduce GRAIL, a novel benchmarking framework designed to evaluate graph AL strategies in dynamic, real-world environments. GRAIL introduces novel metrics to assess sustained effectiveness, diversity, and user burden, enabling a comprehensive evaluation of AL methods under varying conditions. Extensive experiments on datasets featuring dynamic, real-life human sensor data reveal trade-offs between prediction performance and user burden, highlighting limitations in existing AL strategies. GRAIL demonstrates the importance of balancing node importance, query diversity, and network topology, providing an evaluation mechanism for graph AL solutions in dynamic environments.

Title: D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning

Authors: Muqi Zou, Hongyu Cai, Hongwei Wu, Zion Leonahenahe Basque, Arslan Khan, Berkay Celik, Dave (Jing)Tian, Antonio Bianchi, Ruoyu (Fish)Wang, Dongyan Xu
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.10125
Pdf URL: https://arxiv.org/pdf/2506.10125
Copy Paste: [[2506.10125]] D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning(https://arxiv.org/abs/2506.10125)
Keywords: security, large language model
Abstract: Decompilers, which reconstruct human-readable source code from binary executables, are vital to many security tasks. Yet, despite recent advances, their output often suffers from syntactic and semantic errors and remains difficult to read. Recently, with the advent of large language models (LLMs), researchers began to explore the potential of LLMs to refine decompiler output. Nevertheless, our study of these approaches reveals significant limitations, such as introducing new errors and relying on unreliable accuracy validation. In this paper, we present D-LiFT, an automated decompiler backend that harnesses and further trains LLMs to improve the quality of decompiled code via reinforcement learning (RL). Unlike prior work that overlooks preserving accuracy, D-LiFT adheres to a key principle for enhancing the quality of decompiled code: \textit{preserving accuracy while improving readability}. Central to D-LiFT, we propose D-SCORE, an integrated quality assessment system to score the decompiled code from multiple aspects. In line with our principle, D-SCORE assigns low scores to any inaccurate output and only awards higher scores for readability to code that passes the accuracy check. Specifically, D-SCORE first verifies the syntactic and semantic correctness via the compiler and symbolic execution; only if a candidate is deemed accurate, it then evaluates readability using established metrics to compare the LLM output with the original decompiled code. The score will then be fed back to the LLM for fine-tuning. Our implementation, based on Ghidra and a range of LLMs, demonstrates significant improvements for the accurate decompiled code from the coreutils and util-linux projects. Compared to baseline LLMs without D-SCORE-driven fine-tuning, D-LiFT produces 55.3% more improved decompiled functions, as measured by D-SCORE.

Title: ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Authors: Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10128
Pdf URL: https://arxiv.org/pdf/2506.10128
Copy Paste: [[2506.10128]] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs(https://arxiv.org/abs/2506.10128)
Keywords: large language model
Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

Title: Provable Sim-to-Real Transfer via Offline Domain Randomization

Authors: Arnaud Fickinger, Abderrahim Bendahi, Stuart Russell
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.10133
Pdf URL: https://arxiv.org/pdf/2506.10133
Copy Paste: [[2506.10133]] Provable Sim-to-Real Transfer via Offline Domain Randomization(https://arxiv.org/abs/2506.10133)
Keywords: robust
Abstract: Reinforcement-learning agents often struggle when deployed from simulation to the real-world. A dominant strategy for reducing the sim-to-real gap is domain randomization (DR) which trains the policy across many simulators produced by sampling dynamics parameters, but standard DR ignores offline data already available from the real system. We study offline domain randomization (ODR), which first fits a distribution over simulator parameters to an offline dataset. While a growing body of empirical work reports substantial gains with algorithms such as DROPO, the theoretical foundations of ODR remain largely unexplored. In this work, we (i) formalize ODR as a maximum-likelihood estimation over a parametric simulator family, (ii) prove consistency of this estimator under mild regularity and identifiability conditions, showing it converges to the true dynamics as the dataset grows, (iii) derive gap bounds demonstrating ODRs sim-to-real error is up to an O(M) factor tighter than uniform DR in the finite-simulator case (and analogous gains in the continuous setting), and (iv) introduce E-DROPO, a new version of DROPO which adds an entropy bonus to prevent variance collapse, yielding broader randomization and more robust zero-shot transfer in practice.

Title: Physiological-Model-Based Neural Network for Heart Rate Estimation during Daily Physical Activities

Authors: Yaowen Zhang, Libera Fresiello, Peter H. Veltink, Dirk W. Donker, Ying Wang
Subjects: cs.LG, physics.med-ph
Abstract URL: https://arxiv.org/abs/2506.10144
Pdf URL: https://arxiv.org/pdf/2506.10144
Copy Paste: [[2506.10144]] Physiological-Model-Based Neural Network for Heart Rate Estimation during Daily Physical Activities(https://arxiv.org/abs/2506.10144)
Keywords: interpretability
Abstract: Heart failure (HF) poses a significant global health challenge, with early detection offering opportunities for improved outcomes. Abnormalities in heart rate (HR), particularly during daily activities, may serve as early indicators of HF risk. However, existing HR monitoring tools for HF detection are limited by their reliability on population-based averages. The estimation of individualized HR serves as a dynamic digital twin, enabling precise tracking of cardiac health biomarkers. Current HR estimation methods, categorized into physiologically-driven and purely data-driven models, struggle with efficiency and interpretability. This study introduces a novel physiological-model-based neural network (PMB-NN) framework for HR estimation based on oxygen uptake (VO2) data during daily physical activities. The framework was trained and tested on individual datasets from 12 participants engaged in activities including resting, cycling, and running. By embedding physiological constraints, which were derived from our proposed simplified human movement physiological model (PM), into the neural network training process, the PMB-NN model adheres to human physiological principles while achieving high estimation accuracy, with a median R$^2$ score of 0.8 and an RMSE of 8.3 bpm. Comparative statistical analysis demonstrates that the PMB-NN achieves performance on par with the benchmark neural network model while significantly outperforming traditional physiological model (p=0.002). In addition, our PMB-NN is adept at identifying personalized parameters of the PM, enabling the PM to generate reasonable HR estimation. The proposed framework with a precise VO2 estimation system derived from body movements enables the future possibilities of personalized and real-time cardiac monitoring during daily life physical activities.

Title: RoCA: Robust Cross-Domain End-to-End Autonomous Driving

Authors: Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10145
Pdf URL: https://arxiv.org/pdf/2506.10145
Copy Paste: [[2506.10145]] RoCA: Robust Cross-Domain End-to-End Autonomous Driving(https://arxiv.org/abs/2506.10145)
Keywords: robust, large language model
Abstract: End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.

Title: Unconditionally Secure Wireless-Wired Ground-Satellite-Ground Communication Networks Utilizing Classical and Quantum Noise

Authors: Lucas Truax, Sandip Roy, Laszlo B. Kish
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10147
Pdf URL: https://arxiv.org/pdf/2506.10147
Copy Paste: [[2506.10147]] Unconditionally Secure Wireless-Wired Ground-Satellite-Ground Communication Networks Utilizing Classical and Quantum Noise(https://arxiv.org/abs/2506.10147)
Keywords: secure, security, robust
Abstract: In this paper, we introduce the Kirchhoff-Law-Johnson-Noise (KLJN) as an approach to securing satellite communications. KLJN has the potential to revolutionize satellite communication security through its combination of simplicity, cost-effectiveness, and resilience with unconditional security. Unlike quantum key distribution (QKD), which requires complex, fragile, and expensive infrastructure like photon detectors and dedicated optical links, KLJN operates using standard electronic components and wires, significantly reducing implementation costs and logistical hurdles. KLJN's security, grounded in the fundamental laws of classical physics, is impervious to environmental and radiation-induced noise, making it highly reliable in the harsh conditions of satellite communications. This robustness, coupled with its ability to integrate seamlessly with existing infrastructure, positions KLJN as a revolutionary alternative to quantum solutions for ensuring secure, resilient satellite communications. The authors explore the value of achieving unconditionally secure communications in strategic ground-to-satellite networks which address vulnerabilities posed by advanced computational threats, including quantum computing. Our team has examined two leading approaches to unconditional security - the KLJN scheme and QKD - and analyzed the potential use of each for space systems. While QKD leverages quantum mechanics for security, it faces challenges related to cost, complexity, and environmental sensitivity. In contrast, the KLJN scheme utilizes classical physics principles to provide a simpler, more cost-effective, and resilient alternative, particularly for ground-based systems. The study concludes that KLJN offers significant advantages in simplicity, cost-efficiency, and robustness, making it a practical choice for many secure communication applications.

Title: When Large Language Models are Reliable for Judging Empathic Communication

Authors: Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Erina Farrell, Bruce Lambert, Matthew Groh
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.10150
Pdf URL: https://arxiv.org/pdf/2506.10150
Copy Paste: [[2506.10150]] When Large Language Models are Reliable for Judging Empathic Communication(https://arxiv.org/abs/2506.10150)
Keywords: large language model
Abstract: Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks' sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.

Title: Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective

Authors: Yi Wang, Max Kreminski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10161
Pdf URL: https://arxiv.org/pdf/2506.10161
Copy Paste: [[2506.10161]] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective(https://arxiv.org/abs/2506.10161)
Keywords: large language model
Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs' ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs' story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.

Title: Disclosure Audits for LLM Agents

Authors: Saswat Das, Jameson Sandler, Ferdinando Fioretto
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10171
Pdf URL: https://arxiv.org/pdf/2506.10171
Copy Paste: [[2506.10171]] Disclosure Audits for LLM Agents(https://arxiv.org/abs/2506.10171)
Keywords: privacy, defense, large language model
Abstract: Large Language Model agents have begun to appear as personal assistants, customer service bots, and clinical aides. While these applications deliver substantial operational benefits, they also require continuous access to sensitive data, which increases the likelihood of unauthorized disclosures. This study proposes an auditing framework for conversational privacy that quantifies and audits these risks. The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework, is an iterative probing strategy designed to stress-test agents that enforce strict privacy directives. Rather than focusing solely on a single disclosure event, CMPL simulates realistic multi-turn interactions to systematically uncover latent vulnerabilities. Our evaluation on diverse domains, data modalities, and safety configurations demonstrate the auditing framework's ability to reveal privacy risks that are not deterred by existing single-turn defenses. In addition to introducing CMPL as a diagnostic tool, the paper delivers (1) an auditing procedure grounded in quantifiable risk metrics and (2) an open benchmark for evaluation of conversational privacy across agent implementations.

Title: SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score

Authors: Mohammad Jalali, Haoyu Lei, Amin Gohari, Farzan Farnia
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10173
Pdf URL: https://arxiv.org/pdf/2506.10173
Copy Paste: [[2506.10173]] SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score(https://arxiv.org/abs/2506.10173)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware Rény Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: this https URL

Title: Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context

Authors: Yael Frischholz, Devis Tuia, Michael Lehning
Subjects: cs.CV, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2506.10174
Pdf URL: https://arxiv.org/pdf/2506.10174
Copy Paste: [[2506.10174]] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context(https://arxiv.org/abs/2506.10174)
Keywords: transformer
Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont's SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model's ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at this https URL

Title: AURA: A Multi-Agent Intelligence Framework for Knowledge-Enhanced Cyber Threat Attribution

Authors: Nanda Rani, Sandeep Kumar Shukla
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10175
Pdf URL: https://arxiv.org/pdf/2506.10175
Copy Paste: [[2506.10175]] AURA: A Multi-Agent Intelligence Framework for Knowledge-Enhanced Cyber Threat Attribution(https://arxiv.org/abs/2506.10175)
Keywords: attack, large language model
Abstract: Effective attribution of Advanced Persistent Threats (APTs) increasingly hinges on the ability to correlate behavioral patterns and reason over complex, varied threat intelligence artifacts. We present AURA (Attribution Using Retrieval-Augmented Agents), a multi-agent, knowledge-enhanced framework for automated and interpretable APT attribution. AURA ingests diverse threat data including Tactics, Techniques, and Procedures (TTPs), Indicators of Compromise (IoCs), malware details, adversarial tools, and temporal information, which are processed through a network of collaborative agents. These agents are designed for intelligent query rewriting, context-enriched retrieval from structured threat knowledge bases, and natural language justification of attribution decisions. By combining Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs), AURA enables contextual linking of threat behaviors to known APT groups and supports traceable reasoning across multiple attack phases. Experiments on recent APT campaigns demonstrate AURA's high attribution consistency, expert-aligned justifications, and scalability. This work establishes AURA as a promising direction for advancing transparent, data-driven, and scalable threat attribution using multi-agent intelligence.

Title: Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models

Authors: Defang Chen, Zhenyu Zhou, Can Wang, Siwei Lyu
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.10177
Pdf URL: https://arxiv.org/pdf/2506.10177
Copy Paste: [[2506.10177]] Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models(https://arxiv.org/abs/2506.10177)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics: each simulated sampling trajectory lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical ''boomerang'' shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing ODE-based numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only $5 \sim 10$ function evaluations.

Title: Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment

Authors: Yuhui Ding, Thomas Hofmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10186
Pdf URL: https://arxiv.org/pdf/2506.10186
Copy Paste: [[2506.10186]] Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment(https://arxiv.org/abs/2506.10186)
Keywords: diffusion
Abstract: Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate Euclidean symmetries of 3D molecules by utilizing an SE(3)-equivariant denoising network. However, specialized equivariant architectures limit the scalability and efficiency of diffusion models. In this paper, we propose an approach that relaxes such equivariance constraints. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained over the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency. Our code is available at this https URL

Title: Guardians of the Regime: When and Why Autocrats Create Secret Police

Authors: Marius Mehrl, Mila Pfander, Theresa Winner, Cornelius Fritz
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2506.10194
Pdf URL: https://arxiv.org/pdf/2506.10194
Copy Paste: [[2506.10194]] Guardians of the Regime: When and Why Autocrats Create Secret Police(https://arxiv.org/abs/2506.10194)
Keywords: security
Abstract: Autocrats use secret police to stay in power, as these organizations deter and suppress opposition to their rule. Existing research shows that secret police are very good at this but, surprisingly, also that they are not as ubiquitous in autocracies as one may assume, existing in less than 50% of autocratic country-years. We thus explore under which conditions secret police emerge in dictatorships. For this purpose, we apply statistical variable selection techniques to identify which of several candidate variables extracted from the literature on state security forces and authoritarian survival hold explanatory power. Our results highlight that secret police are more likely to emerge when rulers face specific, preempt-able threats, such as protests and anti-system mobilisation, but also when they have the material resources to establish these organisations. This research contributes to our understanding of autocrats' institutional choices and authoritarian politics.

Title: DynaSubVAE: Adaptive Subgrouping for Scalable and Robust OOD Detection

Authors: Tina Behrouzi, Sana Tonekaboni, Rahul G. Krishnan, Anna Goldenberg
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10200
Pdf URL: https://arxiv.org/pdf/2506.10200
Copy Paste: [[2506.10200]] DynaSubVAE: Adaptive Subgrouping for Scalable and Robust OOD Detection(https://arxiv.org/abs/2506.10200)
Keywords: robust
Abstract: Real-world observational data often contain existing or emerging heterogeneous subpopulations that deviate from global patterns. The majority of models tend to overlook these underrepresented groups, leading to inaccurate or even harmful predictions. Existing solutions often rely on detecting these samples as Out-of-domain (OOD) rather than adapting the model to new emerging patterns. We introduce DynaSubVAE, a Dynamic Subgrouping Variational Autoencoder framework that jointly performs representation learning and adaptive OOD detection. Unlike conventional approaches, DynaSubVAE evolves with the data by dynamically updating its latent structure to capture new trends. It leverages a novel non-parametric clustering mechanism, inspired by Gaussian Mixture Models, to discover and model latent subgroups based on embedding similarity. Extensive experiments show that DynaSubVAE achieves competitive performance in both near-OOD and far-OOD detection, and excels in class-OOD scenarios where an entire class is missing during training. We further illustrate that our dynamic subgrouping mechanism outperforms standalone clustering methods such as GMM and KMeans++ in terms of both OOD accuracy and regret precision.

Title: AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

Authors: Jing Liu, Toshiaki Koike-Akino, Ye Wang, Hassan Mansour, Matthew Brand
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10205
Pdf URL: https://arxiv.org/pdf/2506.10205
Copy Paste: [[2506.10205]] AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent(https://arxiv.org/abs/2506.10205)
Keywords: large language model
Abstract: To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

Title: Cross-Learning Between ECG and PCG: Exploring Common and Exclusive Characteristics of Bimodal Electromechanical Cardiac Waveforms

Authors: Sajjad Karimi, Amit J. Shah, Gari D. Clifford, Reza Sameni
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2506.10212
Pdf URL: https://arxiv.org/pdf/2506.10212
Copy Paste: [[2506.10212]] Cross-Learning Between ECG and PCG: Exploring Common and Exclusive Characteristics of Bimodal Electromechanical Cardiac Waveforms(https://arxiv.org/abs/2506.10212)
Keywords: extraction
Abstract: Simultaneous electrocardiography (ECG) and phonocardiogram (PCG) provide a comprehensive, multimodal perspective on cardiac function by capturing the heart's electrical and mechanical activities, respectively. However, the distinct and overlapping information content of these signals, as well as their potential for mutual reconstruction and biomarker extraction, remains incompletely understood, especially under varying physiological conditions and across individuals. In this study, we systematically investigate the common and exclusive characteristics of ECG and PCG using the EPHNOGRAM dataset of simultaneous ECG-PCG recordings during rest and exercise. We employ a suite of linear and nonlinear machine learning models, including non-causal LSTM networks, to reconstruct each modality from the other and analyze the influence of causality, physiological state, and cross-subject variability. Our results demonstrate that nonlinear models, particularly non-causal LSTM, provide superior reconstruction performance, with reconstructing ECG from PCG proving more tractable than the reverse. Exercise and cross-subject scenarios present significant challenges, but envelope-based modeling that utilizes instantaneous amplitude features substantially improves cross-subject generalizability for cross-modal learning. Furthermore, we demonstrate that clinically relevant ECG biomarkers, such as fiducial points and QT intervals, can be estimated from PCG in cross-subject settings. These findings advance our understanding of the relationship between electromechanical cardiac modalities, in terms of both waveform characteristics and the timing of cardiac events, with potential applications in novel multimodal cardiac monitoring technologies.

Title: ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators

Authors: Parsa Rahimi, Sebastien Marcel
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10226
Pdf URL: https://arxiv.org/pdf/2506.10226
Copy Paste: [[2506.10226]] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators(https://arxiv.org/abs/2506.10226)
Keywords: diffusion
Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator's embedding space, rather than close in the generator's condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator's learned condition space and the discriminator's embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: this https URL

Title: California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops

Authors: Hamid Kamangir, Mona Hajiesmaeeli, Mason Earles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10228
Pdf URL: https://arxiv.org/pdf/2506.10228
Copy Paste: [[2506.10228]] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops(https://arxiv.org/abs/2506.10228)
Keywords: extraction
Abstract: California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.

Title: Classifying Unreliable Narrators with Large Language Models

Authors: Anneliese Brei, Katharine Henry, Abhisheik Sharma, Shashank Srivastava, Snigdha Chaturvedi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10231
Pdf URL: https://arxiv.org/pdf/2506.10231
Copy Paste: [[2506.10231]] Classifying Unreliable Narrators with Large Language Models(https://arxiv.org/abs/2506.10231)
Keywords: large language model
Abstract: Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.

Title: LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation

Authors: Chen-Chia Chang, Wan-Hsuan Lin, Yikang Shen, Yiran Chen, Xin Zhang
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2506.10235
Pdf URL: https://arxiv.org/pdf/2506.10235
Copy Paste: [[2506.10235]] LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation(https://arxiv.org/abs/2506.10235)
Keywords: robust
Abstract: Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based analog topology generation. SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to O(|V |), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our experiments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance of 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5% improvement. These advancements establish LaMAGIC2 as a robust framework for analog topology generation.

Title: Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Authors: Yeonwoo Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz
Subjects: cs.CR, cs.AI, cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10236
Pdf URL: https://arxiv.org/pdf/2506.10236
Copy Paste: [[2506.10236]] Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods(https://arxiv.org/abs/2506.10236)
Keywords: attack, robust
Abstract: In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.

Title: A new type of federated clustering: A non-model-sharing approach

Authors: Yuji Kawamata, Kaoru Kamijo, Maki Kihira, Akihiro Toyoda, Tomoru Nakayama, Akira Imakura, Tetsuya Sakurai, Yukihiko Okada
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10244
Pdf URL: https://arxiv.org/pdf/2506.10244
Copy Paste: [[2506.10244]] A new type of federated clustering: A non-model-sharing approach(https://arxiv.org/abs/2506.10244)
Keywords: privacy, federate
Abstract: In recent years, the growing need to leverage sensitive data across institutions has led to increased attention on federated learning (FL), a decentralized machine learning paradigm that enables model training without sharing raw data. However, existing FL-based clustering methods, known as federated clustering, typically assume simple data partitioning scenarios such as horizontal or vertical splits, and cannot handle more complex distributed structures. This study proposes data collaboration clustering (DC-Clustering), a novel federated clustering method that supports clustering over complex data partitioning scenarios where horizontal and vertical splits coexist. In DC-Clustering, each institution shares only intermediate representations instead of raw data, ensuring privacy preservation while enabling collaborative clustering. The method allows flexible selection between k-means and spectral clustering, and achieves final results with a single round of communication with the central server. We conducted extensive experiments using synthetic and open benchmark datasets. The results show that our method achieves clustering performance comparable to centralized clustering where all data are pooled. DC-Clustering addresses an important gap in current FL research by enabling effective knowledge discovery from distributed heterogeneous data. Its practical properties -- privacy preservation, communication efficiency, and flexibility -- make it a promising tool for privacy-sensitive domains such as healthcare and finance.

Title: ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese

Authors: Iago Alves Brito, Julia Soares Dollis, Fernanda Bufon Färber, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10245
Pdf URL: https://arxiv.org/pdf/2506.10245
Copy Paste: [[2506.10245]] ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese(https://arxiv.org/abs/2506.10245)
Keywords: protect, robust
Abstract: We present ToxSyn-PT, the first large-scale Portuguese corpus that enables fine-grained hate-speech classification across nine legally protected minority groups. The dataset contains 53,274 synthetic sentences equally distributed between minorities groups and toxicity labels. ToxSyn-PT is created through a novel four-stage pipeline: (1) a compact, manually curated seed; (2) few-shot expansion with an instruction-tuned LLM; (3) paraphrase-based augmentation; and (4) enrichment, plus additional neutral texts to curb overfitting to group-specific cues. The resulting corpus is class-balanced, stylistically diverse, and free from the social-media domain that dominate existing Portuguese datasets. Despite domain differences with traditional benchmarks, experiments on both binary and multi-label classification on the corpus yields strong results across five public Portuguese hate-speech datasets, demonstrating robust generalization even across domain boundaries. The dataset is publicly released to advance research on synthetic data and hate-speech detection in low-resource settings.

Title: Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models

Authors: Andrea Yaoyun Cui, Pengfei Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10268
Pdf URL: https://arxiv.org/pdf/2506.10268
Copy Paste: [[2506.10268]] Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models(https://arxiv.org/abs/2506.10268)
Keywords: large language model
Abstract: Language models are essentially probability distributions over token sequences. Auto-regressive models generate sentences by iteratively computing and sampling from the distribution of the next token. This iterative sampling introduces stochasticity, leading to the assumption that language models make probabilistic decisions, similar to sampling from unknown distributions. Building on this assumption, prior research has used simulated Gibbs sampling, inspired by experiments designed to elicit human priors, to infer the priors of language models. In this paper, we revisit a critical question: Do language models possess Bayesian brains? Our findings show that under certain conditions, language models can exhibit near-deterministic decision-making, such as producing maximum likelihood estimations, even with a non-zero sampling temperature. This challenges the sampling assumption and undermines previous methods for eliciting human-like priors. Furthermore, we demonstrate that without proper scrutiny, a system with deterministic behavior undergoing simulated Gibbs sampling can converge to a "false prior." To address this, we propose a straightforward approach to distinguish between stochastic and deterministic decision patterns in Gibbs sampling, helping to prevent the inference of misleading language model priors. We experiment on a variety of large language models to identify their decision patterns under various circumstances. Our results provide key insights in understanding decision making of large language models.

Title: Interior-Point Vanishing Problem in Semidefinite Relaxations for Neural Network Verification

Authors: Ryota Ueda, Takami Sato, Ken Kobayashi, Kazuhide Nakata
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2506.10269
Pdf URL: https://arxiv.org/pdf/2506.10269
Copy Paste: [[2506.10269]] Interior-Point Vanishing Problem in Semidefinite Relaxations for Neural Network Verification(https://arxiv.org/abs/2506.10269)
Keywords: secure
Abstract: Semidefinite programming (SDP) relaxation has emerged as a promising approach for neural network verification, offering tighter bounds than other convex relaxation methods for deep neural networks (DNNs) with ReLU activations. However, we identify a critical limitation in the SDP relaxation when applied to deep networks: interior-point vanishing, which leads to the loss of strict feasibility -- a crucial condition for the numerical stability and optimality of SDP. Through rigorous theoretical and empirical analysis, we demonstrate that as the depth of DNNs increases, the strict feasibility is likely to be lost, creating a fundamental barrier to scaling SDP-based verification. To address the interior-point vanishing, we design and investigate five solutions to enhance the feasibility conditions of the verification problem. Our methods can successfully solve 88% of the problems that could not be solved by existing methods, accounting for 41% of the total. Our analysis also reveals that the valid constraints for the lower and upper bounds for each ReLU unit are traditionally inherited from prior work without solid reasons, but are actually not only unbeneficial but also even harmful to the problem's feasibility. This work provides valuable insights into the fundamental challenges of SDP-based DNN verification and offers practical solutions to improve its applicability to deeper neural networks, contributing to the development of more reliable and secure systems with DNNs.

Title: Graph-MLLM: Harnessing Multimodal Large Language Models for Multimodal Graph Learning

Authors: Jiajin Liu, Dongzhe Fan, Jiacheng Shen, Chuanhao Ji, Daochen Zha, Qiaoyu Tan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10282
Pdf URL: https://arxiv.org/pdf/2506.10282
Copy Paste: [[2506.10282]] Graph-MLLM: Harnessing Multimodal Large Language Models for Multimodal Graph Learning(https://arxiv.org/abs/2506.10282)
Keywords: fair, large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities. However, they typically focus on modality alignment in a pairwise manner while overlooking structural relationships across data points. Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applications such as social networks, healthcare, and recommendation systems. Existing MMG learning methods fall into three paradigms based on how they leverage MLLMs: Encoder, Aligner, and Predictor. MLLM-as-Encoder focuses on enhancing graph neural networks (GNNs) via multimodal feature fusion; MLLM-as-Aligner aligns multimodal attributes in language or hidden space to enable LLM-based graph reasoning; MLLM-as-Predictor treats MLLMs as standalone reasoners with in-context learning or fine-tuning. Despite their advances, the MMG field lacks a unified benchmark to fairly evaluate across these approaches, making it unclear what progress has been made. To bridge this gap, we present Graph-MLLM, a comprehensive benchmark for multimodal graph learning by systematically evaluating these three paradigms across six datasets with different domains. Through extensive experiments, we observe that jointly considering the visual and textual attributes of the nodes benefits graph learning, even when using pre-trained text-to-image alignment models (e.g., CLIP) as encoders. We also find that converting visual attributes into textual descriptions further improves performance compared to directly using visual inputs. Moreover, we observe that fine-tuning MLLMs on specific MMGs can achieve state-of-the-art results in most scenarios, even without explicit graph structure information. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field.

Title: HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Authors: Eunkyu Park, Minyeong Kim, Gunhee Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10286
Pdf URL: https://arxiv.org/pdf/2506.10286
Copy Paste: [[2506.10286]] HalLoc: Token-level Localization of Hallucinations for Vision Language Models(https://arxiv.org/abs/2506.10286)
Keywords: robust
Abstract: Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: this https URL.

Title: ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs

Authors: Zige Wang, Qi Zhu, Fei Mi, Minghui Xu, Ruochun Jin, Wenjing Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10288
Pdf URL: https://arxiv.org/pdf/2506.10288
Copy Paste: [[2506.10288]] ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs(https://arxiv.org/abs/2506.10288)
Keywords: large language model
Abstract: Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many resources to be feasible in practice. In this paper, we propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition that data samples with similar gradient features will have similar influences, we first perform clustering on the training data pool. Then, we frame the inter-cluster data selection as a constrained computing budget allocation problem and consider it a multi-armed bandit problem. A modified UCB algorithm is leveraged to solve this problem. Specifically, during the iterative sampling process, historical data influence information is recorded to directly estimate the distributions of each cluster, and a cold start is adopted to balance exploration and exploitation. Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods while greatly reducing computing consumption.

Title: Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

Authors: Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi, Imran Razzak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10292
Pdf URL: https://arxiv.org/pdf/2506.10292
Copy Paste: [[2506.10292]] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages(https://arxiv.org/abs/2506.10292)
Keywords: robust
Abstract: Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick's efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.

Title: "Check My Work?": Measuring Sycophancy in a Simulated Educational Context

Authors: Chuck Arvin
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.10297
Pdf URL: https://arxiv.org/pdf/2506.10297
Copy Paste: [[2506.10297]] "Check My Work?": Measuring Sycophancy in a Simulated Educational Context(https://arxiv.org/abs/2506.10297)
Keywords: large language model
Abstract: This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs "flip" their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.

Title: Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs

Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.10299
Pdf URL: https://arxiv.org/pdf/2506.10299
Copy Paste: [[2506.10299]] Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs(https://arxiv.org/abs/2506.10299)
Keywords: large language model
Abstract: Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech--text training in this study. We use interleaved speech--text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.

Title: Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation

Authors: Hamzeh Asgharnezhad, Pegah Tabarisaadi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharya
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10302
Pdf URL: https://arxiv.org/pdf/2506.10302
Copy Paste: [[2506.10302]] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation(https://arxiv.org/abs/2506.10302)
Keywords: transformer
Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, but their performance can be limited by data scarcity and a lack of uncertainty awareness. In this study, we present a comprehensive evaluation of DL-based skin lesion classification using transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the first phase, we benchmarked several pre-trained feature extractors-including Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50 (ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range of traditional classifiers such as Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and logistic regression. Our results show that CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM, deliver the highest classification performance. In the second phase, we incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) to assess not only prediction accuracy but also the reliability of model outputs. We evaluated these models using uncertainty-aware metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen), uncertainty specificity(USpe), and uncertainty precision(UPre). The results demonstrate that ensemble methods offer a good trade-off between accuracy and uncertainty handling, while EMCD is more sensitive to uncertain predictions. This study highlights the importance of integrating UQ into DL-based medical diagnosis to enhance both performance and trustworthiness in real-world clinical applications.

Title: ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

Authors: Chuyang Chen, Brendan Dolan-Gavitt, Zhiqiang Lin
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.10323
Pdf URL: https://arxiv.org/pdf/2506.10323
Copy Paste: [[2506.10323]] ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space(https://arxiv.org/abs/2506.10323)
Keywords: large language model
Abstract: Generation-based fuzzing produces appropriate testing cases according to specifications of input grammars and semantic constraints to test systems and software. However, these specifications require significant manual efforts to construct. This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailored to a system under test (SUT) via LLM-driven synthesis over fuzzer space. At a high level, it starts with minimal seed fuzzers and propels the synthesis by fully automated LLM-driven evolution with coverage guidance. Compared to previous approaches, ELFuzz can 1) seamlessly scale to SUTs of real-world sizes -- up to 1,791,104 lines of code in our evaluation -- and 2) synthesize efficient fuzzers that catch interesting grammatical structures and semantic constraints in a human-understandable way. Our evaluation compared ELFuzz with specifications manually written by domain experts and synthesized by state-of-the-art approaches. It shows that ELFuzz achieves up to 434.8% more coverage and triggers up to 174.0% more artificially injected bugs. We also used ELFuzz to conduct a real-world fuzzing campaign on the newest version of cvc5 for 14 days, and encouragingly, it found five 0-day bugs (three are exploitable). Moreover, we conducted an ablation study, which shows that the fuzzer space model, the key component of ELFuzz, contributes the most (up to 62.5%) to the effectiveness of ELFuzz. Further analysis of the fuzzers synthesized by ELFuzz confirms that they catch interesting grammatical structures and semantic constraints in a human-understandable way. The results present the promising potential of ELFuzz for more automated, efficient, and extensible input generation for fuzzing.

Title: A Comprehensive Survey of Unmanned Aerial Systems' Risks and Mitigation Strategies

Authors: Sharad Shrestha, Mohammed Ababneh, Satyajayant Misra, Henry M. Cathey Jr., Roopa Vishwanathan, Matt Jansen, Jinhong Choi, Rakesh Bobba, Yeongjin Jang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10327
Pdf URL: https://arxiv.org/pdf/2506.10327
Copy Paste: [[2506.10327]] A Comprehensive Survey of Unmanned Aerial Systems' Risks and Mitigation Strategies(https://arxiv.org/abs/2506.10327)
Keywords: security, defense, attack
Abstract: In the last decade, the rapid growth of Unmanned Aircraft Systems (UAS) and Unmanned Aircraft Vehicles (UAV) in communication, defense, and transportation has increased. The application of UAS will continue to increase rapidly. This has led researchers to examine security vulnerabilities in various facets of UAS infrastructure and UAVs, which form a part of the UAS system to reinforce these critical systems. This survey summarizes the cybersecurity vulnerabilities in several phases of UAV deployment, the likelihood of each vulnerability's occurrence, the impact of attacks, and mitigation strategies that could be applied. We go beyond the state-of-the-art by taking a comprehensive approach to enhancing UAS security by performing an analysis of both UAS-specific and non-UAS-specific mitigation strategies that are applicable within the UAS domain to define the lessons learned. We also present relevant cybersecurity standards and their recommendations in the UAS context. Despite the significant literature in UAS security and the relevance of cyberphysical and networked systems security approaches from the past, which we identify in the survey, we find several critical research gaps that require further investigation. These form part of our discussions and recommendations for the future exploration by our research community.

Title: Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video

Authors: Fei Zhao, Da Pan, Zelu Qi, Ping Shi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.10331
Pdf URL: https://arxiv.org/pdf/2506.10331
Copy Paste: [[2506.10331]] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video(https://arxiv.org/abs/2506.10331)
Keywords: extraction
Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.

Title: GeoCAD: Local Geometry-Controllable CAD Generation

Authors: Zhanwei Zhang, Kaiyuan Liu, Junjie Liu, Wenxiao Wang, Binbin Lin, Liang Xie, Chen Shen, Deng Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10337
Pdf URL: https://arxiv.org/pdf/2506.10337
Copy Paste: [[2506.10337]] GeoCAD: Local Geometry-Controllable CAD Generation(https://arxiv.org/abs/2506.10337)
Keywords: large language model
Abstract: Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at this https URL.

Title: Adaptive Chosen-Ciphertext Security of Distributed Broadcast Encryption

Authors: Kwangsu Lee
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10338
Pdf URL: https://arxiv.org/pdf/2506.10338
Copy Paste: [[2506.10338]] Adaptive Chosen-Ciphertext Security of Distributed Broadcast Encryption(https://arxiv.org/abs/2506.10338)
Keywords: secure, security, attack
Abstract: Distributed broadcast encryption (DBE) is a specific kind of broadcast encryption (BE) where users independently generate their own public and private keys, and a sender can efficiently create a ciphertext for a subset of users by using the public keys of the subset users. Previously proposed DBE schemes have been proven in the adaptive chosen-plaintext attack (CPA) security model and have the disadvantage of requiring linear number of pairing operations when verifying the public key of a user. In this paper, we propose an efficient DBE scheme in bilinear groups and prove adaptive chosen-ciphertext attack (CCA) security for the first time. To do this, we first propose a semi-static CCA secure DBE scheme and prove the security under the $q$-Type assumption. Then, by modifying the generic transformation of Gentry and Waters that converts a semi-static CPA secure DBE scheme into an adaptive CPA secure DBE scheme to be applied to CCA secure DBE schemes, we propose an adaptive CCA secure DBE scheme and prove its adaptive CCA security. Our proposed DBE scheme is efficient because it requires constant size ciphertexts, constant size private keys, and linear size public keys, and the public key verification requires only a constant number of pairing operations and efficient group membership checks.

Title: Provably Learning from Language Feedback

Authors: Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10341
Pdf URL: https://arxiv.org/pdf/2506.10341
Copy Paste: [[2506.10341]] Provably Learning from Language Feedback(https://arxiv.org/abs/2506.10341)
Keywords: large language model
Abstract: Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

Title: UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models

Authors: Jun Yin, Jing Zhong, Peilin Li, Pengyu Zeng, Miao Zhang, Ran Luo, Shuai Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10342
Pdf URL: https://arxiv.org/pdf/2506.10342
Copy Paste: [[2506.10342]] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models(https://arxiv.org/abs/2506.10342)
Keywords: large language model
Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method's ability to capture subtle stylistic differences. These results highlight the method's potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.

Title: Code Execution as Grounded Supervision for LLM Reasoning

Authors: Dongwon Jung, Wenxuan Zhou, Muhao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10343
Pdf URL: https://arxiv.org/pdf/2506.10343
Copy Paste: [[2506.10343]] Code Execution as Grounded Supervision for LLM Reasoning(https://arxiv.org/abs/2506.10343)
Keywords: large language model
Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

Title: PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Authors: Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10351
Pdf URL: https://arxiv.org/pdf/2506.10351
Copy Paste: [[2506.10351]] PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation(https://arxiv.org/abs/2506.10351)
Keywords: transformer
Abstract: Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications.

Title: History-Aware Neural Operator: Robust Data-Driven Constitutive Modeling of Path-Dependent Materials

Authors: Binyao Guo, Zihan Lin, QiZhi He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10352
Pdf URL: https://arxiv.org/pdf/2506.10352
Copy Paste: [[2506.10352]] History-Aware Neural Operator: Robust Data-Driven Constitutive Modeling of Path-Dependent Materials(https://arxiv.org/abs/2506.10352)
Keywords: robust, extraction
Abstract: This study presents an end-to-end learning framework for data-driven modeling of path-dependent inelastic materials using neural operators. The framework is built on the premise that irreversible evolution of material responses, governed by hidden dynamics, can be inferred from observable data. We develop the History-Aware Neural Operator (HANO), an autoregressive model that predicts path-dependent material responses from short segments of recent strain-stress history without relying on hidden state variables, thereby overcoming self-consistency issues commonly encountered in recurrent neural network (RNN)-based models. Built on a Fourier-based neural operator backbone, HANO enables discretization-invariant learning. To enhance its ability to capture both global loading patterns and critical local path dependencies, we embed a hierarchical self-attention mechanism that facilitates multiscale feature extraction. Beyond ensuring self-consistency, HANO mitigates sensitivity to initial hidden states, a commonly overlooked issue that can lead to instability in recurrent models when applied to generalized loading paths. By modeling stress-strain evolution as a continuous operator rather than relying on fixed input-output mappings, HANO naturally accommodates varying path discretizations and exhibits robust performance under complex conditions, including irregular sampling, multi-cycle loading, noisy data, and pre-stressed states. We evaluate HANO on two benchmark problems: elastoplasticity with hardening and progressive anisotropic damage in brittle solids. Results show that HANO consistently outperforms baseline models in predictive accuracy, generalization, and robustness. With its demonstrated capabilities, HANO provides an effective data-driven surrogate for simulating inelastic materials and is well-suited for integration with classical numerical solvers.

Title: Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation

Authors: Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10353
Pdf URL: https://arxiv.org/pdf/2506.10353
Copy Paste: [[2506.10353]] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation(https://arxiv.org/abs/2506.10353)
Keywords: large language model
Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.

Title: TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree

Authors: Yu-Yang Qian, Yuan-Ze Xu, Zhen-Yu Zhang, Peng Zhao, Zhi-Hua Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10355
Pdf URL: https://arxiv.org/pdf/2506.10355
Copy Paste: [[2506.10355]] TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree(https://arxiv.org/abs/2506.10355)
Keywords: transformer, large language model
Abstract: Many real-world applications collect data in a streaming environment, where learning tasks are encountered sequentially. This necessitates continual learning (CL) to update models online, enabling adaptation to new tasks while preserving past knowledge to prevent catastrophic forgetting. Nowadays, with the flourish of large pre-trained models (LPMs), efficiency has become increasingly critical for CL, due to their substantial computational demands and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of Low-Rank Adapters), a novel approach that constructs layer-wise adapters by leveraging hierarchical gradient similarity to enable efficient CL, particularly for LPMs. To reduce the computational burden of task similarity estimation, we employ bandit techniques to develop an algorithm based on lower confidence bounds to efficiently explore the task structure. Furthermore, we use sparse gradient updates to facilitate parameter optimization, making the approach better suited for LPMs. Theoretical analysis is provided to justify the rationale behind our approach, and experiments on both vision transformers (ViTs) and large language models (LLMs) demonstrate the effectiveness and efficiency of our approach across various domains, including vision and natural language processing tasks.

Title: FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device

Authors: Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10361
Pdf URL: https://arxiv.org/pdf/2506.10361
Copy Paste: [[2506.10361]] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device(https://arxiv.org/abs/2506.10361)
Keywords: transformer
Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.

Title: Can We Infer Confidential Properties of Training Data from LLMs?

Authors: Penguin Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2506.10364
Pdf URL: https://arxiv.org/pdf/2506.10364
Copy Paste: [[2506.10364]] Can We Infer Confidential Properties of Training Data from LLMs?(https://arxiv.org/abs/2506.10364)
Keywords: attack, generative, large language model
Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

Title: FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion

Authors: Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuhan Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10366
Pdf URL: https://arxiv.org/pdf/2506.10366
Copy Paste: [[2506.10366]] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion(https://arxiv.org/abs/2506.10366)
Keywords: transformer
Abstract: The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at this https URL.

Title: Revisiting Transformers with Insights from Image Filtering

Authors: Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10371
Pdf URL: https://arxiv.org/pdf/2506.10371
Copy Paste: [[2506.10371]] Revisiting Transformers with Insights from Image Filtering(https://arxiv.org/abs/2506.10371)
Keywords: robust, interpretability, transformer
Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.

Title: EQA-RM: A Generative Embodied Reward Model with Test-time Scaling

Authors: Yuhang Chen, Zhen Tan, Tianlong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10389
Pdf URL: https://arxiv.org/pdf/2506.10389
Copy Paste: [[2506.10389]] EQA-RM: A Generative Embodied Reward Model with Test-time Scaling(https://arxiv.org/abs/2506.10389)
Keywords: generative
Abstract: Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents' spatial, temporal, and logical understanding is critical yet not considered by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA reward model assessment. Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9\% accuracy on EQA-RM-Bench with only 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and VisualPRM. The code and dataset can be found here this https URL.

Title: DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba

Authors: Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10390
Pdf URL: https://arxiv.org/pdf/2506.10390
Copy Paste: [[2506.10390]] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba(https://arxiv.org/abs/2506.10390)
Keywords: transformer
Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at this https URL.

Title: ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion

Authors: Yuanyi Song, Pumeng Lyu, Ben Fei, Fenghua Ling, Wanli Ouyang, Lei Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10391
Pdf URL: https://arxiv.org/pdf/2506.10391
Copy Paste: [[2506.10391]] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion(https://arxiv.org/abs/2506.10391)
Keywords: robust, diffusion
Abstract: Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at this https URL.

Title: Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Authors: Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10395
Pdf URL: https://arxiv.org/pdf/2506.10395
Copy Paste: [[2506.10395]] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation(https://arxiv.org/abs/2506.10395)
Keywords: robust, generative, large language model
Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.

Title: FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks

Authors: Zhaoxuan Kan, Husheng Han, Shangyi Shi, Tenghui Hua, Hang Lu, Xiaowei Li, Jianan Mu, Xing Hu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10399
Pdf URL: https://arxiv.org/pdf/2506.10399
Copy Paste: [[2506.10399]] FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks(https://arxiv.org/abs/2506.10399)
Keywords: privacy
Abstract: Graph Convolutional Neural Networks (GCNs) have gained widespread popularity in various fields like personal healthcare and financial systems, due to their remarkable performance. Despite the growing demand for cloud-based GCN services, privacy concerns over sensitive graph data remain significant. Homomorphic Encryption (HE) facilitates Privacy-Preserving Machine Learning (PPML) by allowing computations to be performed on encrypted data. However, HE introduces substantial computational overhead, particularly for GCN operations that require rotations and multiplications in matrix products. The sparsity of GCNs offers significant performance potential, but their irregularity introduces additional operations that reduce practical gains. In this paper, we propose FicGCN, a HE-based framework specifically designed to harness the sparse characteristics of GCNs and strike a globally optimal balance between aggregation and combination operations. FicGCN employs a latency-aware packing scheme, a Sparse Intra-Ciphertext Aggregation (SpIntra-CA) method to minimize rotation overhead, and a region-based data reordering driven by local adjacency structure. We evaluated FicGCN on several popular datasets, and the results show that FicGCN achieved the best performance across all tested datasets, with up to a 4.10x improvement over the latest design.

Title: Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation

Authors: Tzu-Heng Huang, Harit Vishwakarma, Frederic Sala
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10403
Pdf URL: https://arxiv.org/pdf/2506.10403
Copy Paste: [[2506.10403]] Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation(https://arxiv.org/abs/2506.10403)
Keywords: large language model
Abstract: Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.

Title: Generative Algorithms for Wildfire Progression Reconstruction from Multi-Modal Satellite Active Fire Measurements and Terrain Height

Authors: Bryan Shaddy, Brianna Binder, Agnimitra Dasgupta, Haitong Qin, James Haley, Angel Farguell, Kyle Hilburn, Derek V. Mallia, Adam Kochanski, Jan Mandel, Assad Oberai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10404
Pdf URL: https://arxiv.org/pdf/2506.10404
Copy Paste: [[2506.10404]] Generative Algorithms for Wildfire Progression Reconstruction from Multi-Modal Satellite Active Fire Measurements and Terrain Height(https://arxiv.org/abs/2506.10404)
Keywords: generative
Abstract: Increasing wildfire occurrence has spurred growing interest in wildfire spread prediction. However, even the most complex wildfire models diverge from observed progression during multi-day simulations, motivating need for data assimilation. A useful approach to assimilating measurement data into complex coupled atmosphere-wildfire models is to estimate wildfire progression from measurements and use this progression to develop a matching atmospheric state. In this study, an approach is developed for estimating fire progression from VIIRS active fire measurements, GOES-derived ignition times, and terrain height data. A conditional Generative Adversarial Network is trained with simulations of historic wildfires from the atmosphere-wildfire model WRF-SFIRE, thus allowing incorporation of WRF-SFIRE physics into estimates. Fire progression is succinctly represented by fire arrival time, and measurements for training are obtained by applying an approximate observation operator to WRF-SFIRE solutions, eliminating need for satellite data during training. The model is trained on tuples of fire arrival times, measurements, and terrain, and once trained leverages measurements of real fires and corresponding terrain data to generate samples of fire arrival times. The approach is validated on five Pacific US wildfires, with results compared against high-resolution perimeters measured via aircraft, finding an average Sorensen-Dice coefficient of 0.81. The influence of terrain height on the arrival time inference is also evaluated and it is observed that terrain has minimal influence when the inference is conditioned on satellite measurements.

Title: PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Authors: Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10406
Pdf URL: https://arxiv.org/pdf/2506.10406
Copy Paste: [[2506.10406]] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier(https://arxiv.org/abs/2506.10406)
Keywords: generative, large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.

Title: Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Authors: Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.10415
Pdf URL: https://arxiv.org/pdf/2506.10415
Copy Paste: [[2506.10415]] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?(https://arxiv.org/abs/2506.10415)
Keywords: large language model
Abstract: This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at this https URL.

Title: Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting

Authors: Avneet Kaur, Arnav Arora
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10421
Pdf URL: https://arxiv.org/pdf/2506.10421
Copy Paste: [[2506.10421]] Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting(https://arxiv.org/abs/2506.10421)
Keywords: large language model
Abstract: Framing used by news media, especially in times of conflict, can have substantial impact on readers' opinion, potentially aggravating the conflict itself. Current studies on the topic of conflict framing have limited insights due to their qualitative nature or only look at surface level generic frames without going deeper. In this work, we identify indicators of war and peace journalism, as outlined by prior work in conflict studies, in a corpus of news articles reporting on the Israel-Palestine war. For our analysis, we use computational approaches, using a combination of frame semantics and large language models to identify both communicative framing and its connection to linguistic framing. Our analysis reveals a higher focus on war based reporting rather than peace based. We also show substantial differences in reporting across the US, UK, and Middle Eastern news outlets in framing who the assailant and victims of the conflict are, surfacing biases within the media.

Title: SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks

Authors: Kaiyuan Zhang, Siyuan Cheng, Hanxi Guo, Yuetian Chen, Zian Su, Shengwei An, Yuntao Du, Charles Fleming, Ashish Kundu, Xiangyu Zhang, Ninghui Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10424
Pdf URL: https://arxiv.org/pdf/2506.10424
Copy Paste: [[2506.10424]] SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks(https://arxiv.org/abs/2506.10424)
Keywords: privacy, protect, defense, attack, membership infer, large language model
Abstract: Large language models (LLMs) have achieved remarkable success and are widely adopted for diverse applications. However, fine-tuning these models often involves private or sensitive information, raising critical privacy concerns. In this work, we conduct the first comprehensive study evaluating the vulnerability of fine-tuned LLMs to membership inference attacks (MIAs). Our empirical analysis demonstrates that MIAs exploit the loss reduction during fine-tuning, making them highly effective in revealing membership information. These findings motivate the development of our defense. We propose SOFT (\textbf{S}elective data \textbf{O}bfuscation in LLM \textbf{F}ine-\textbf{T}uning), a novel defense technique that mitigates privacy leakage by leveraging influential data selection with an adjustable parameter to balance utility preservation and privacy protection. Our extensive experiments span six diverse domains and multiple LLM architectures and scales. Results show that SOFT effectively reduces privacy risks while maintaining competitive model performance, offering a practical and scalable solution to safeguard sensitive information in fine-tuned LLMs.

Title: It's Not the Target, It's the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations

Authors: Guoyi Zhang, Guangsheng Xu, Siyang Chen, Han Wang, Xiaohu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10425
Pdf URL: https://arxiv.org/pdf/2506.10425
Copy Paste: [[2506.10425]] It's Not the Target, It's the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations(https://arxiv.org/abs/2506.10425)
Keywords: robust
Abstract: Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression--reconstruction--subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model's resilience to sensor noise. The source code will be made publicly available upon acceptance.

Title: MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment

Authors: Shuo wang, Jihao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10430
Pdf URL: https://arxiv.org/pdf/2506.10430
Copy Paste: [[2506.10430]] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment(https://arxiv.org/abs/2506.10430)
Keywords: extraction, transformer, segmentation
Abstract: The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9\% and 0.6\% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.

Title: System Identification Using Kolmogorov-Arnold Networks: A Case Study on Buck Converters

Authors: Nart Gashi, Panagiotis Kakosimos, George Papafotiou
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.10434
Pdf URL: https://arxiv.org/pdf/2506.10434
Copy Paste: [[2506.10434]] System Identification Using Kolmogorov-Arnold Networks: A Case Study on Buck Converters(https://arxiv.org/abs/2506.10434)
Keywords: interpretability
Abstract: Kolmogorov-Arnold Networks (KANs) are emerging as a powerful framework for interpretable and efficient system identification in dynamic systems. By leveraging the Kolmogorov-Arnold representation theorem, KANs enable function approximation through learnable activation functions, offering improved scalability, accuracy, and interpretability compared to traditional neural networks. This paper investigates the application of KANs to model and analyze the dynamics of a buck converter system, focusing on state-space parameter estimation along with discovering the system equations. Using simulation data, the methodology involves approximating state derivatives with KANs, constructing interpretable state-space representations, and validating these models through numerical experiments. The results demonstrate the ability of KANs to accurately identify system dynamics, verify model consistency, and detect parameter changes, providing valuable insights into their applicability for system identification in modern industrial systems.

Title: MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

Authors: Zhaode Wang, Jingbang Yang, Xinyu Qian, Shiwen Xing, Xiaotang Jiang, Chengfei Lv, Shengyu Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10443
Pdf URL: https://arxiv.org/pdf/2506.10443
Copy Paste: [[2506.10443]] MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices(https://arxiv.org/abs/2506.10443)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to enhance performance. Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.

Title: Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty

Authors: Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo, Yuan Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10446
Pdf URL: https://arxiv.org/pdf/2506.10446
Copy Paste: [[2506.10446]] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty(https://arxiv.org/abs/2506.10446)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem's complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model's overall performance. Specifically, we manage the model's reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.

Title: Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

Authors: Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen
Subjects: cs.CV, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.10452
Pdf URL: https://arxiv.org/pdf/2506.10452
Copy Paste: [[2506.10452]] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts(https://arxiv.org/abs/2506.10452)
Keywords: robust, transformer
Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at this https URL.

Title: Rethinking Generative Human Video Coding with Implicit Motion Transformation

Authors: Bolin Chen, Ru-Ling Liao, Jie Chen, Yan Ye
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.10453
Pdf URL: https://arxiv.org/pdf/2506.10453
Copy Paste: [[2506.10453]] Rethinking Generative Human Video Coding with Implicit Motion Transformation(https://arxiv.org/abs/2506.10453)
Keywords: generative
Abstract: Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.

Title: Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Intermediate Feature Distance

Authors: Chun Liu, Bingqian Zhu, Tao Xu, Zheng Zheng, Zheng Li, Wei Yang, Zhigang Han, Jiayao Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.10459
Pdf URL: https://arxiv.org/pdf/2506.10459
Copy Paste: [[2506.10459]] Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Intermediate Feature Distance(https://arxiv.org/abs/2506.10459)
Keywords: security, defense, attack, robust
Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI adversarial examples remains limited and faces challenges in fully utilizing the structural and feature information of images. To address these issues, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification models. First, while keeping the image structure unchanged, the proposed method randomly divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on a block by block basis to increase input diversity and mitigate overfitting. Second, a feature distancing loss targeting intermediate layers is designed, which measures the distance between the amplified features of the original examples and the features of the adversarial examples as the primary loss, while the output layer prediction serves as the auxiliary loss. This guides the perturbation to disrupt the features of the true class in adversarial examples, effectively enhancing transferability. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve effective transferability to black-box models on two public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies.

Title: Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization

Authors: Stone Yun, Alexander Wong
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2506.10463
Pdf URL: https://arxiv.org/pdf/2506.10463
Copy Paste: [[2506.10463]] Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization(https://arxiv.org/abs/2506.10463)
Keywords: robust
Abstract: Deep neural network (DNN) quantization for fast, efficient inference has been an important tool in limiting the cost of machine learning (ML) model inference. Quantization-specific model development techniques such as regularization, quantization-aware training, and quantization-robustness penalties have served to greatly boost the accuracy and robustness of modern DNNs. However, very little exploration has been done on improving the initial conditions of DNN training for quantization. Just as random weight initialization has been shown to significantly impact test accuracy of floating point models, it would make sense that different weight initialization methods impact quantization robustness of trained models. We present an extensive study examining the effects of different weight initializations on a variety of CNN building blocks commonly used in efficient CNNs. This analysis reveals that even with varying CNN architectures, the choice of random weight initializer can significantly affect final quantization robustness. Next, we explore a new method for quantization-robust CNN initialization -- using Graph Hypernetworks (GHN) to predict parameters of quantized DNNs. Besides showing that GHN-predicted parameters are quantization-robust after regular float32 pretraining (of the GHN), we find that finetuning GHNs to predict parameters for quantized graphs (which we call GHN-QAT) can further improve quantized accuracy of CNNs. Notably, GHN-QAT shows significant accuracy improvements for even 4-bit quantization and better-than-random accuracy for 2-bits. To the best of our knowledge, this is the first in-depth study on quantization-aware DNN weight initialization. GHN-QAT offers a novel approach to quantized DNN model design. Future investigations, such as using GHN-QAT-initialized parameters for quantization-aware training, can further streamline the DNN quantization process.

Title: MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

Authors: Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang, Wei Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10465
Pdf URL: https://arxiv.org/pdf/2506.10465
Copy Paste: [[2506.10465]] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models(https://arxiv.org/abs/2506.10465)
Keywords: large language model, segmentation
Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R's superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.

Title: Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications

Authors: Felix Härer
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10467
Pdf URL: https://arxiv.org/pdf/2506.10467
Copy Paste: [[2506.10467]] Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications(https://arxiv.org/abs/2506.10467)
Keywords: security
Abstract: Recent advancements in LLMs indicate potential for novel applications, e.g., through reasoning capabilities in the latest OpenAI and DeepSeek models. For applying these models in specific domains beyond text generation, LLM-based multi-agent approaches can be utilized that solve complex tasks by combining reasoning techniques, code generation, and software execution. Applications might utilize these capabilities and the knowledge of specialized LLM agents. However, while many evaluations are performed on LLMs, reasoning techniques, and applications individually, their joint specification and combined application is not explored well. Defined specifications for multi-agent LLM systems are required to explore their potential and their suitability for specific applications, allowing for systematic evaluations of LLMs, reasoning techniques, and related aspects. This paper reports the results of exploratory research to specify and evaluate these aspects through a multi-agent system. The system architecture and prototype are extended from previous research and a specification is introduced for multi-agent systems. Test cases involving cybersecurity tasks indicate feasibility of the architecture and evaluation approach. In particular, the results show the evaluation of question answering, server security, and network security tasks that were completed correctly by agents with LLMs from OpenAI and DeepSeek.

Title: LLMs Are Not Yet Ready for Deepfake Image Detection

Authors: Shahroz Tariq, David Nguyen, M.A.P. Chamikara, Tingmin Wu, Alsharif Abuadbba, Kristen Moore
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10474
Pdf URL: https://arxiv.org/pdf/2506.10474
Copy Paste: [[2506.10474]] LLMs Are Not Yet Ready for Deepfake Image Detection(https://arxiv.org/abs/2506.10474)
Keywords: interpretability, large language model
Abstract: The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model's classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.

Title: Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Authors: Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10486
Pdf URL: https://arxiv.org/pdf/2506.10486
Copy Paste: [[2506.10486]] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers(https://arxiv.org/abs/2506.10486)
Keywords: interpretability
Abstract: Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.

Title: Class-Incremental Learning for Honey Botanical Origin Classification with Hyperspectral Images: A Study with Continual Backpropagation

Authors: Guyang Zhang, Waleed Abdulla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10489
Pdf URL: https://arxiv.org/pdf/2506.10489
Copy Paste: [[2506.10489]] Class-Incremental Learning for Honey Botanical Origin Classification with Hyperspectral Images: A Study with Continual Backpropagation(https://arxiv.org/abs/2506.10489)
Keywords: protect
Abstract: Honey is an important commodity in the global market. Honey types of different botanical origins provide diversified flavors and health benefits, thus having different market values. Developing accurate and effective botanical origin-distinguishing techniques is crucial to protect consumers' interests. However, it is impractical to collect all the varieties of honey products at once to train a model for botanical origin differentiation. Therefore, researchers developed class-incremental learning (CIL) techniques to address this challenge. This study examined and compared multiple CIL algorithms on a real-world honey hyperspectral imaging dataset. A novel technique is also proposed to improve the performance of class-incremental learning algorithms by combining with a continual backpropagation (CB) algorithm. The CB method addresses the issue of loss-of-plasticity by reinitializing a proportion of less-used hidden neurons to inject variability into neural networks. Experiments showed that CB improved the performance of most CIL methods by 1-7\%.

Title: Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models

Authors: Aleksandra Sorokovikova, Pavel Chizhov, Iuliia Eremenko, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10491
Pdf URL: https://arxiv.org/pdf/2506.10491
Copy Paste: [[2506.10491]] Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models(https://arxiv.org/abs/2506.10491)
Keywords: fair, large language model
Abstract: Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user's answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.

Title: A Crack in the Bark: Leveraging Public Knowledge to Remove Tree-Ring Watermarks

Authors: Junhua Lin, Marc Juarez (University of Edinburgh)
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10502
Pdf URL: https://arxiv.org/pdf/2506.10502
Copy Paste: [[2506.10502]] A Crack in the Bark: Leveraging Public Knowledge to Remove Tree-Ring Watermarks(https://arxiv.org/abs/2506.10502)
Keywords: attack, robust, watermark, diffusion
Abstract: We present a novel attack specifically designed against Tree-Ring, a watermarking technique for diffusion models known for its high imperceptibility and robustness against removal attacks. Unlike previous removal attacks, which rely on strong assumptions about attacker capabilities, our attack only requires access to the variational autoencoder that was used to train the target diffusion model, a component that is often publicly available. By leveraging this variational autoencoder, the attacker can approximate the model's intermediate latent space, enabling more effective surrogate-based attacks. Our evaluation shows that this approach leads to a dramatic reduction in the AUC of Tree-Ring detector's ROC and PR curves, decreasing from 0.993 to 0.153 and from 0.994 to 0.385, respectively, while maintaining high image quality. Notably, our attacks outperform existing methods that assume full access to the diffusion model. These findings highlight the risk of reusing public autoencoders to train diffusion models -- a threat not considered by current industry practices. Furthermore, the results suggest that the Tree-Ring detector's precision, a metric that has been overlooked by previous evaluations, falls short of the requirements for real-world deployment.

Title: Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation

Authors: Shuyang Li, Shuang Wang, Zhuangzhuang Sun, Jing Xiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10503
Pdf URL: https://arxiv.org/pdf/2506.10503
Copy Paste: [[2506.10503]] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation(https://arxiv.org/abs/2506.10503)
Keywords: segmentation
Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textit{prompt-generated semantic localization guiding Segment Anything Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex this http URL further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art this http URL code will be made publicly available.

Title: Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

Authors: Sangmin Song, Juhwan Choi, JungMin Yun, YoungBin Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10504
Pdf URL: https://arxiv.org/pdf/2506.10504
Copy Paste: [[2506.10504]] Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models(https://arxiv.org/abs/2506.10504)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user's utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.

Title: J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft

Authors: Jin Huang, Mingqiang Wei, Zikuan Li, Hangyu Qu, Wei Zhao, Xinyu Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10505
Pdf URL: https://arxiv.org/pdf/2506.10505
Copy Paste: [[2506.10505]] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft(https://arxiv.org/abs/2506.10505)
Keywords: extraction
Abstract: Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.

Title: Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

Authors: Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10508
Pdf URL: https://arxiv.org/pdf/2506.10508
Copy Paste: [[2506.10508]] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs(https://arxiv.org/abs/2506.10508)
Keywords: large language model
Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.

Title: CogStream: Context-guided Streaming Video Question Answering

Authors: Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10516
Pdf URL: https://arxiv.org/pdf/2506.10516
Copy Paste: [[2506.10516]] CogStream: Context-guided Streaming Video Question Answering(https://arxiv.org/abs/2506.10516)
Keywords: large language model
Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.

Title: ALBERT: Advanced Localization and Bidirectional Encoder Representations from Transformers for Automotive Damage Evaluation

Authors: Teerapong Panboonyuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10524
Pdf URL: https://arxiv.org/pdf/2506.10524
Copy Paste: [[2506.10524]] ALBERT: Advanced Localization and Bidirectional Encoder Representations from Transformers for Automotive Damage Evaluation(https://arxiv.org/abs/2506.10524)
Keywords: transformer, segmentation
Abstract: This paper introduces ALBERT, an instance segmentation model specifically designed for comprehensive car damage and part segmentation. Leveraging the power of Bidirectional Encoder Representations, ALBERT incorporates advanced localization mechanisms to accurately identify and differentiate between real and fake damages, as well as segment individual car parts. The model is trained on a large-scale, richly annotated automotive dataset that categorizes damage into 26 types, identifies 7 fake damage variants, and segments 61 distinct car parts. Our approach demonstrates strong performance in both segmentation accuracy and damage classification, paving the way for intelligent automotive inspection and assessment applications.

Title: SLICK: Selective Localization and Instance Calibration for Knowledge-Enhanced Car Damage Segmentation in Automotive Insurance

Authors: Teerapong Panboonyuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10528
Pdf URL: https://arxiv.org/pdf/2506.10528
Copy Paste: [[2506.10528]] SLICK: Selective Localization and Instance Calibration for Knowledge-Enhanced Car Damage Segmentation in Automotive Insurance(https://arxiv.org/abs/2506.10528)
Keywords: robust, segmentation
Abstract: We present SLICK, a novel framework for precise and robust car damage segmentation that leverages structural priors and domain knowledge to tackle real-world automotive inspection challenges. SLICK introduces five key components: (1) Selective Part Segmentation using a high-resolution semantic backbone guided by structural priors to achieve surgical accuracy in segmenting vehicle parts even under occlusion, deformation, or paint loss; (2) Localization-Aware Attention blocks that dynamically focus on damaged regions, enhancing fine-grained damage detection in cluttered and complex street scenes; (3) an Instance-Sensitive Refinement head that leverages panoptic cues and shape priors to disentangle overlapping or adjacent parts, enabling precise boundary alignment; (4) Cross-Channel Calibration through multi-scale channel attention that amplifies subtle damage signals such as scratches and dents while suppressing noise like reflections and decals; and (5) a Knowledge Fusion Module that integrates synthetic crash data, part geometry, and real-world insurance datasets to improve generalization and handle rare cases effectively. Experiments on large-scale automotive datasets demonstrate SLICK's superior segmentation performance, robustness, and practical applicability for insurance and automotive inspection workflows.

Title: Equivariant Neural Diffusion for Molecule Generation

Authors: François Cornet, Grigory Bartosh, Mikkel N. Schmidt, Christian A. Naesseth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10532
Pdf URL: https://arxiv.org/pdf/2506.10532
Copy Paste: [[2506.10532]] Equivariant Neural Diffusion for Molecule Generation(https://arxiv.org/abs/2506.10532)
Keywords: diffusion, generative
Abstract: We introduce Equivariant Neural Diffusion (END), a novel diffusion model for molecule generation in 3D that is equivariant to Euclidean transformations. Compared to current state-of-the-art equivariant diffusion models, the key innovation in END lies in its learnable forward process for enhanced generative modelling. Rather than pre-specified, the forward process is parameterized through a time- and data-dependent transformation that is equivariant to rigid transformations. Through a series of experiments on standard molecule generation benchmarks, we demonstrate the competitive performance of END compared to several strong baselines for both unconditional and conditional generation.

Title: Data-driven Day Ahead Market Prices Forecasting: A Focus on Short Training Set Windows

Authors: Vasilis Michalakopoulos, Christoforos Menos-Aikateriniadis, Elissaios Sarmas, Antonis Zakynthinos, Pavlos S. Georgilakis, Dimitris Askounis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10536
Pdf URL: https://arxiv.org/pdf/2506.10536
Copy Paste: [[2506.10536]] Data-driven Day Ahead Market Prices Forecasting: A Focus on Short Training Set Windows(https://arxiv.org/abs/2506.10536)
Keywords: robust
Abstract: This study investigates the performance of machine learning models in forecasting electricity Day-Ahead Market (DAM) prices using short historical training windows, with a focus on detecting seasonal trends and price spikes. We evaluate four models, namely LSTM with Feed Forward Error Correction (FFEC), XGBoost, LightGBM, and CatBoost, across three European energy markets (Greece, Belgium, Ireland) using feature sets derived from ENTSO-E forecast data. Training window lengths range from 7 to 90 days, allowing assessment of model adaptability under constrained data availability. Results indicate that LightGBM consistently achieves the highest forecasting accuracy and robustness, particularly with 45 and 60 day training windows, which balance temporal relevance and learning depth. Furthermore, LightGBM demonstrates superior detection of seasonal effects and peak price events compared to LSTM and other boosting models. These findings suggest that short-window training approaches, combined with boosting methods, can effectively support DAM forecasting in volatile, data-scarce environments.

Title: From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations

Authors: Yutong Zhou, Masahiro Ryo
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2506.10559
Pdf URL: https://arxiv.org/pdf/2506.10559
Copy Paste: [[2506.10559]] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations(https://arxiv.org/abs/2506.10559)
Keywords: extraction, large language model
Abstract: Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language.

Title: Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics

Authors: Imanol Solano, Julian Fierrez, Aythami Morales, Alejandro Peña, Ruben Tolosana, Francisco Zamora-Martinez, Javier San Agustin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10564
Pdf URL: https://arxiv.org/pdf/2506.10564
Copy Paste: [[2506.10564]] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics(https://arxiv.org/abs/2506.10564)
Keywords: robust, biometric, fair
Abstract: Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI's superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.

Title: LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System

Authors: Hongbeen Park, Minjeong Park, Giljoo Nam, Jinkyu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10567
Pdf URL: https://arxiv.org/pdf/2506.10567
Copy Paste: [[2506.10567]] LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System(https://arxiv.org/abs/2506.10567)
Keywords: robust
Abstract: Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in achieving real-time performance, robustness, and scalability for large-scale scenes. Recent approaches utilizing neural implicit scene representations show promise but suffer from high computational costs and memory requirements. ESLAM introduced a plane-based tensor decomposition but still struggled with memory growth. Addressing these challenges, we propose a more efficient visual SLAM model, called LRSLAM, utilizing low-rank tensor decomposition methods. Our approach, leveraging the Six-axis and CP decompositions, achieves better convergence rates, memory efficiency, and reconstruction/localization quality than existing state-of-the-art approaches. Evaluation across diverse indoor RGB-D datasets demonstrates LRSLAM's superior performance in terms of parameter efficiency, processing time, and accuracy, retaining reconstruction and localization quality. Our code will be publicly available upon publication.

Title: DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Authors: Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wang, Zerong Zheng, Ming Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10568
Pdf URL: https://arxiv.org/pdf/2506.10568
Copy Paste: [[2506.10568]] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers(https://arxiv.org/abs/2506.10568)
Keywords: diffusion, transformer
Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: this https URL.

Title: Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration

Authors: Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, Yulan He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10573
Pdf URL: https://arxiv.org/pdf/2506.10573
Copy Paste: [[2506.10573]] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration(https://arxiv.org/abs/2506.10573)
Keywords: robust, segmentation
Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.

Title: DanceChat: Large Language Model-Guided Music-to-Dance Generation

Authors: Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh, Shanxin Yuan
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.10574
Pdf URL: https://arxiv.org/pdf/2506.10574
Copy Paste: [[2506.10574]] DanceChat: Large Language Model-Guided Music-to-Dance Generation(https://arxiv.org/abs/2506.10574)
Keywords: extraction, diffusion, large language model
Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the modelâĂŹs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.

Title: Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning

Authors: Chun-Mei Feng, Kai Yu, Xinxing Xu, Salman Khan, Rick Siow Mong Goh, Wangmeng Zuo, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10575
Pdf URL: https://arxiv.org/pdf/2506.10575
Copy Paste: [[2506.10575]] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning(https://arxiv.org/abs/2506.10575)
Keywords: robust
Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.

Title: Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres

Authors: Muskan Dosi, Chiranjeev Chiranjeev, Kartik Thakral, Mayank Vatsa, Richa Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10576
Pdf URL: https://arxiv.org/pdf/2506.10576
Copy Paste: [[2506.10576]] Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres(https://arxiv.org/abs/2506.10576)
Keywords: diffusion, generative
Abstract: Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce HyperSphereDiff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: {this https URL}

Title: Graph Neural Networks for Automatic Addition of Optimizing Components in Printed Circuit Board Schematics

Authors: Pascal Plettenberg, André Alcalde, Bernhard Sick, Josephine M. Thomas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10577
Pdf URL: https://arxiv.org/pdf/2506.10577
Copy Paste: [[2506.10577]] Graph Neural Networks for Automatic Addition of Optimizing Components in Printed Circuit Board Schematics(https://arxiv.org/abs/2506.10577)
Keywords: robust
Abstract: The design and optimization of Printed Circuit Board (PCB) schematics is crucial for the development of high-quality electronic devices. Thereby, an important task is to optimize drafts by adding components that improve the robustness and reliability of the circuit, e.g., pull-up resistors or decoupling capacitors. Since there is a shortage of skilled engineers and manual optimizations are very time-consuming, these best practices are often neglected. However, this typically leads to higher costs for troubleshooting in later development stages as well as shortened product life cycles, resulting in an increased amount of electronic waste that is difficult to recycle. Here, we present an approach for automating the addition of new components into PCB schematics by representing them as bipartite graphs and utilizing a node pair prediction model based on Graph Neural Networks (GNNs). We apply our approach to three highly relevant PCB design optimization tasks and compare the performance of several popular GNN architectures on real-world datasets labeled by human experts. We show that GNNs can solve these problems with high accuracy and demonstrate that our approach offers the potential to automate PCB design optimizations in a time- and cost-efficient manner.

Title: Rethinking Random Masking in Self Distillation on ViT

Authors: Jihyeon Seong, Hyunkyung Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10582
Pdf URL: https://arxiv.org/pdf/2506.10582
Copy Paste: [[2506.10582]] Rethinking Random Masking in Self Distillation on ViT(https://arxiv.org/abs/2506.10582)
Keywords: robust, transformer
Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student's global view, while preserving the student's local views and the teacher's global view in their original, unmasked forms. This design leverages DINO's multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.

Title: Size-adaptive Hypothesis Testing for Fairness

Authors: Antonio Ferrara, Francesco Cozzi, Alan Perotti, André Panisson, Francesco Bonchi
Subjects: cs.LG, cs.AI, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2506.10586
Pdf URL: https://arxiv.org/pdf/2506.10586
Copy Paste: [[2506.10586]] Size-adaptive Hypothesis Testing for Fairness(https://arxiv.org/abs/2506.10586)
Keywords: fair
Abstract: Determining whether an algorithmic decision-making system discriminates against a specific demographic typically involves comparing a single point estimate of a fairness metric against a predefined threshold. This practice is statistically brittle: it ignores sampling error and treats small demographic subgroups the same as large ones. The problem intensifies in intersectional analyses, where multiple sensitive attributes are considered jointly, giving rise to a larger number of smaller groups. As these groups become more granular, the data representing them becomes too sparse for reliable estimation, and fairness metrics yield excessively wide confidence intervals, precluding meaningful conclusions about potential unfair treatments. In this paper, we introduce a unified, size-adaptive, hypothesis-testing framework that turns fairness assessment into an evidence-based statistical decision. Our contribution is twofold. (i) For sufficiently large subgroups, we prove a Central-Limit result for the statistical parity difference, leading to analytic confidence intervals and a Wald test whose type-I (false positive) error is guaranteed at level $\alpha$. (ii) For the long tail of small intersectional groups, we derive a fully Bayesian Dirichlet-multinomial estimator; Monte-Carlo credible intervals are calibrated for any sample size and naturally converge to Wald intervals as more data becomes available. We validate our approach empirically on benchmark datasets, demonstrating how our tests provide interpretable, statistically rigorous decisions under varying degrees of data availability and intersectionality.

Title: SoK: Evaluating Jailbreak Guardrails for Large Language Models

Authors: Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10597
Pdf URL: https://arxiv.org/pdf/2506.10597
Copy Paste: [[2506.10597]] SoK: Evaluating Jailbreak Guardrails for Large Language Models(https://arxiv.org/abs/2506.10597)
Keywords: security, defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at this https URL.

Title: Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

Authors: Xinyuan Liu, Hang Xu, Yike Ma, Yucheng Zhang, Feng Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10601
Pdf URL: https://arxiv.org/pdf/2506.10601
Copy Paste: [[2506.10601]] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection(https://arxiv.org/abs/2506.10601)
Keywords: extraction
Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at this https URL.

Title: High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model

Authors: Eshan Ramesh, Nishio Takayuki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10605
Pdf URL: https://arxiv.org/pdf/2506.10605
Copy Paste: [[2506.10605]] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model(https://arxiv.org/abs/2506.10605)
Keywords: diffusion
Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM's denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM's pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.

Title: TexTailor: Customized Text-aligned Texturing via Effective Resampling

Authors: Suin Lee, Dae-Shik Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10612
Pdf URL: https://arxiv.org/pdf/2506.10612
Copy Paste: [[2506.10612]] TexTailor: Customized Text-aligned Texturing via Effective Resampling(https://arxiv.org/abs/2506.10612)
Keywords: diffusion
Abstract: We present TexTailor, a novel method for generating consistent object textures from textual descriptions. Existing text-to-texture synthesis approaches utilize depth-aware diffusion models to progressively generate images and synthesize textures across predefined multiple viewpoints. However, these approaches lead to a gradual shift in texture properties across viewpoints due to (1) insufficient integration of previously synthesized textures at each viewpoint during the diffusion process and (2) the autoregressive nature of the texture synthesis process. Moreover, the predefined selection of camera positions, which does not account for the object's geometry, limits the effective use of texture information synthesized from different viewpoints, ultimately degrading overall texture consistency. In TexTailor, we address these issues by (1) applying a resampling scheme that repeatedly integrates information from previously synthesized textures within the diffusion process, and (2) fine-tuning a depth-aware diffusion model on these resampled textures. During this process, we observed that using only a few training images restricts the model's original ability to generate high-fidelity images aligned with the conditioning, and therefore propose an performance preservation loss to mitigate this issue. Additionally, we improve the synthesis of view-consistent textures by adaptively adjusting camera positions based on the object's geometry. Experiments on a subset of the Objaverse dataset and the ShapeNet car dataset demonstrate that TexTailor outperforms state-of-the-art methods in synthesizing view-consistent textures. The source code for TexTailor is available at this https URL

Title: Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code

Authors: Reza Karbasi, Masoud Rahimi, Abdol-Hossein Vahabie, Hadi Moradi
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.10617
Pdf URL: https://arxiv.org/pdf/2506.10617
Copy Paste: [[2506.10617]] Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code(https://arxiv.org/abs/2506.10617)
Keywords: robust, segmentation
Abstract: This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline's 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: this https URL.

Title: Assessing the Resilience of Automotive Intrusion Detection Systems to Adversarial Manipulation

Authors: Stefano Longari, Paolo Cerracchio, Michele Carminati, Stefano Zanero
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10620
Pdf URL: https://arxiv.org/pdf/2506.10620
Copy Paste: [[2506.10620]] Assessing the Resilience of Automotive Intrusion Detection Systems to Adversarial Manipulation(https://arxiv.org/abs/2506.10620)
Keywords: security, attack, robust
Abstract: The security of modern vehicles has become increasingly important, with the controller area network (CAN) bus serving as a critical communication backbone for various Electronic Control Units (ECUs). The absence of robust security measures in CAN, coupled with the increasing connectivity of vehicles, makes them susceptible to cyberattacks. While intrusion detection systems (IDSs) have been developed to counter such threats, they are not foolproof. Adversarial attacks, particularly evasion attacks, can manipulate inputs to bypass detection by IDSs. This paper extends our previous work by investigating the feasibility and impact of gradient-based adversarial attacks performed with different degrees of knowledge against automotive IDSs. We consider three scenarios: white-box (attacker with full system knowledge), grey-box (partial system knowledge), and the more realistic black-box (no knowledge of the IDS' internal workings or data). We evaluate the effectiveness of the proposed attacks against state-of-the-art IDSs on two publicly available datasets. Additionally, we study effect of the adversarial perturbation on the attack impact and evaluate real-time feasibility by precomputing evasive payloads for timed injection based on bus traffic. Our results demonstrate that, besides attacks being challenging due to the automotive domain constraints, their effectiveness is strongly dependent on the dataset quality, the target IDS, and the attacker's degree of knowledge.

Title: SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

Authors: Sergio Burdisso, Esaú Villatoro-Tello, Petr Motlicek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10622
Pdf URL: https://arxiv.org/pdf/2506.10622
Copy Paste: [[2506.10622]] SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis(https://arxiv.org/abs/2506.10622)
Keywords: large language model
Abstract: The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today's fast-evolving research landscape.

Title: NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

Authors: Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10627
Pdf URL: https://arxiv.org/pdf/2506.10627
Copy Paste: [[2506.10627]] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors(https://arxiv.org/abs/2506.10627)
Keywords: transformer, large language model
Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at this https URL.

Title: Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

Authors: Yucong Luo, Yitong Zhou, Mingyue Cheng, Jiahao Wang, Daoyu Wang, Tingyue Pan, Jintao Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10630
Pdf URL: https://arxiv.org/pdf/2506.10630
Copy Paste: [[2506.10630]] Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs(https://arxiv.org/abs/2506.10630)
Keywords: privacy
Abstract: To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical techniques to data-driven deep learning architectures. Despite their effectiveness, most existing methods still adhere to a fast thinking paradigm-relying on extracting historical patterns and mapping them to future values as their core modeling philosophy, lacking an explicit thinking process that incorporates intermediate time series reasoning. Meanwhile, emerging slow-thinking LLMs (e.g., OpenAI-o1) have shown remarkable multi-step reasoning capabilities, offering an alternative way to overcome these issues. However, prompt engineering alone presents several limitations - including high computational cost, privacy risks, and limited capacity for in-depth domain-specific time series reasoning. To address these limitations, a more promising approach is to train LLMs to develop slow thinking capabilities and acquire strong time series reasoning skills. For this purpose, we propose Time-R1, a two-stage reinforcement fine-tuning framework designed to enhance multi-step reasoning ability of LLMs for time series forecasting. Specifically, the first stage conducts supervised fine-tuning for warmup adaptation, while the second stage employs reinforcement learning to improve the model's generalization ability. Particularly, we design a fine-grained multi-objective reward specifically for time series forecasting, and then introduce GRIP (group-based relative importance for policy optimization), which leverages non-uniform sampling to further encourage and optimize the model's exploration of effective reasoning paths. Experiments demonstrate that Time-R1 significantly improves forecast performance across diverse datasets.

Title: Hessian Geometry of Latent Space in Generative Models

Authors: Alexander Lobashev, Dmitry Guskov, Maria Larchenko, Mikhail Tamm
Subjects: cs.LG, cond-mat.stat-mech, cs.CV, math.DG, math.ST
Abstract URL: https://arxiv.org/abs/2506.10632
Pdf URL: https://arxiv.org/pdf/2506.10632
Copy Paste: [[2506.10632]] Hessian Geometry of Latent Space in Generative Models(https://arxiv.org/abs/2506.10632)
Keywords: diffusion, generative
Abstract: This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at this https URL.

Title: Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models

Authors: Konstantinos Vilouras, Ilias Stogiannidis, Junyu Yan, Alison Q. O'Neil, Sotirios A. Tsaftaris
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10633
Pdf URL: https://arxiv.org/pdf/2506.10633
Copy Paste: [[2506.10633]] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models(https://arxiv.org/abs/2506.10633)
Keywords: privacy, robust, diffusion
Abstract: Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.

Title: Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models

Authors: Francisco Caetano, Christiaan Viviers, Peter H.N. De With, Fons van der Sommen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10634
Pdf URL: https://arxiv.org/pdf/2506.10634
Copy Paste: [[2506.10634]] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models(https://arxiv.org/abs/2506.10634)
Keywords: generative, segmentation
Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.

Title: CyFence: Securing Cyber-Physical Controllers via Trusted Execution Environment

Authors: Stefano Longari, Alessandro Pozone, Jessica Leoni, Mario Polino, Michele Carminati, Mara Tanelli, Stefano Zanero
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10638
Pdf URL: https://arxiv.org/pdf/2506.10638
Copy Paste: [[2506.10638]] CyFence: Securing Cyber-Physical Controllers via Trusted Execution Environment(https://arxiv.org/abs/2506.10638)
Keywords: security, defense, attack
Abstract: In the last decades, Cyber-physical Systems (CPSs) have experienced a significant technological evolution and increased connectivity, at the cost of greater exposure to cyber-attacks. Since many CPS are used in safety-critical systems, such attacks entail high risks and potential safety harms. Although several defense strategies have been proposed, they rarely exploit the cyber-physical nature of the system. In this work, we exploit the nature of CPS by proposing CyFence, a novel architecture that improves the resilience of closed-loop control systems against cyber-attacks by adding a semantic check, used to confirm that the system is behaving as expected. To ensure the security of the semantic check code, we use the Trusted Execution Environment implemented by modern processors. We evaluate CyFence considering a real-world application, consisting of an active braking digital controller, demonstrating that it can mitigate different types of attacks with a negligible computation overhead.

Title: GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

Authors: Xiaoyi Bao, Jindi Lv, Xiaofeng Wang, Zheng Zhu, Xinze Chen, YuKun Zhou, Jiancheng Lv, Xingang Wang, Guan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10639
Pdf URL: https://arxiv.org/pdf/2506.10639
Copy Paste: [[2506.10639]] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning(https://arxiv.org/abs/2506.10639)
Keywords: diffusion
Abstract: Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.

Title: Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

Authors: Tatsuya Hiraoka, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10641
Pdf URL: https://arxiv.org/pdf/2506.10641
Copy Paste: [[2506.10641]] Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters(https://arxiv.org/abs/2506.10641)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.

Title: From IOCs to Group Profiles: On the Specificity of Threat Group Behaviors in CTI Knowledge Bases

Authors: Aakanksha Saha, Martina Lindorfer, Juan Caballero
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10645
Pdf URL: https://arxiv.org/pdf/2506.10645
Copy Paste: [[2506.10645]] From IOCs to Group Profiles: On the Specificity of Threat Group Behaviors in CTI Knowledge Bases(https://arxiv.org/abs/2506.10645)
Keywords: security
Abstract: Indicators of Compromise (IOCs) such as IP addresses, file hashes, and domain names are commonly used for threat detection and attribution. However, IOCs tend to be short-lived as they are easy to change. As a result, the cybersecurity community is shifting focus towards more persistent behavioral profiles, such as the Tactics, Techniques, and Procedures (TTPs) and the software used by a threat group. However, the distinctiveness and completeness of such behavioral profiles remain largely unexplored. In this work, we systematically analyze threat group profiles built from two open cyber threat intelligence (CTI) knowledge bases: MITRE ATT&CK and Malpedia. We first investigate what fraction of threat groups have group-specific behaviors, i.e., behaviors used exclusively by a single group. We find that only 34% of threat groups in ATT&CK have group-specific techniques. The software used by a threat group proves to be more distinctive, with 73% of ATT&CK groups using group-specific software. However, this percentage drops to 24% in the broader Malpedia dataset. Next, we evaluate how group profiles improve when data from both sources are combined. While coverage improves modestly, the proportion of groups with group-specific behaviors remains under 30%. We then enhance profiles by adding exploited vulnerabilities and additional techniques extracted from more threat reports. Despite the additional information, 64% of groups still lack any group-specific behavior. Our findings raise concerns on the belief that behavioral profiles can replace IOCs in threat group attribution.

Title: Data Shifts Hurt CoT: A Theoretical Study

Authors: Lang Yin, Debangshu Banerjee, Gagandeep Singh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10647
Pdf URL: https://arxiv.org/pdf/2506.10647
Copy Paste: [[2506.10647]] Data Shifts Hurt CoT: A Theoretical Study(https://arxiv.org/abs/2506.10647)
Keywords: transformer, large language model
Abstract: Chain of Thought (CoT) has been applied to various large language models (LLMs) and proven to be effective in improving the quality of outputs. In recent studies, transformers are proven to have absolute upper bounds in terms of expressive power, and consequently, they cannot solve many computationally difficult problems. However, empowered by CoT, transformers are proven to be able to solve some difficult problems effectively, such as the $k$-parity problem. Nevertheless, those works rely on two imperative assumptions: (1) identical training and testing distribution, and (2) corruption-free training data with correct reasoning steps. However, in the real world, these assumptions do not always hold. Although the risks of data shifts have caught attention, our work is the first to rigorously study the exact harm caused by such shifts to the best of our knowledge. Focusing on the $k$-parity problem, in this work we investigate the joint impact of two types of data shifts: the distribution shifts and data poisoning, on the quality of trained models obtained by a well-established CoT decomposition. In addition to revealing a surprising phenomenon that CoT leads to worse performance on learning parity than directly generating the prediction, our technical results also give a rigorous and comprehensive explanation of the mechanistic reasons of such impact.

Title: GOLIATH: A Decentralized Framework for Data Collection in Intelligent Transportation Systems

Authors: Davide Maffiola, Stefano Longari, Michele Carminati, Mara Tanelli, Stefano Zanero
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10665
Pdf URL: https://arxiv.org/pdf/2506.10665
Copy Paste: [[2506.10665]] GOLIATH: A Decentralized Framework for Data Collection in Intelligent Transportation Systems(https://arxiv.org/abs/2506.10665)
Keywords: robust
Abstract: Intelligent Transportation Systems (ITSs) technology has advanced during the past years, and it is now used for several applications that require vehicles to exchange real-time data, such as in traffic information management. Traditionally, road traffic information has been collected using on-site sensors. However, crowd-sourcing traffic information from onboard sensors or smartphones has become a viable alternative. State-of-the-art solutions currently follow a centralized model where only the service provider has complete access to the collected traffic data and represent a single point of failure and trust. In this paper, we propose GOLIATH, a blockchain-based decentralized framework that runs on the In-Vehicle Infotainment (IVI) system to collect real-time information exchanged between the network's participants. Our approach mitigates the limitations of existing crowd-sourcing centralized solutions by guaranteeing trusted information collection and exchange, fully exploiting the intrinsic distributed nature of vehicles. We demonstrate its feasibility in the context of vehicle positioning and traffic information management. Each vehicle participating in the decentralized network shares its position and neighbors' ones in the form of a transaction recorded on the ledger, which uses a novel consensus mechanism to validate it. We design the consensus mechanism resilient against a realistic set of adversaries that aim to tamper or disable the communication. We evaluate the proposed framework in a simulated (but realistic) environment, which considers different threats and allows showing its robustness and safety properties.

Title: PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis

Authors: Marzieh Oghbaie, Teresa Araújoa, Hrvoje Bogunović
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10669
Pdf URL: https://arxiv.org/pdf/2506.10669
Copy Paste: [[2506.10669]] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis(https://arxiv.org/abs/2506.10669)
Keywords: robust, interpretability, transformer
Abstract: Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: this https URL

Title: Saturation Self-Organizing Map

Authors: Igor Urbanik, Paweł Gajewski
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10680
Pdf URL: https://arxiv.org/pdf/2506.10680
Copy Paste: [[2506.10680]] Saturation Self-Organizing Map(https://arxiv.org/abs/2506.10680)
Keywords: interpretability
Abstract: Continual learning poses a fundamental challenge for neural systems, which often suffer from catastrophic forgetting when exposed to sequential tasks. Self-Organizing Maps (SOMs), despite their interpretability and efficiency, are not immune to this issue. In this paper, we introduce Saturation Self-Organizing Maps (SatSOM)-an extension of SOMs designed to improve knowledge retention in continual learning scenarios. SatSOM incorporates a novel saturation mechanism that gradually reduces the learning rate and neighborhood radius of neurons as they accumulate information. This effectively freezes well-trained neurons and redirects learning to underutilized areas of the map.

Title: Enhancing Deepfake Detection using SE Block Attention with CNN

Authors: Subhram Dasgupta, Janelle Mason, Xiaohong Yuan, Olusola Odeyomi, Kaushik Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10683
Pdf URL: https://arxiv.org/pdf/2506.10683
Copy Paste: [[2506.10683]] Enhancing Deepfake Detection using SE Block Attention with CNN(https://arxiv.org/abs/2506.10683)
Keywords: security
Abstract: In the digital age, Deepfake present a formidable challenge by using advanced artificial intelligence to create highly convincing manipulated content, undermining information authenticity and security. These sophisticated fabrications surpass traditional detection methods in complexity and realism. To address this issue, we aim to harness cutting-edge deep learning methodologies to engineer an innovative deepfake detection model. However, most of the models designed for deepfake detection are large, causing heavy storage and memory consumption. In this research, we propose a lightweight convolution neural network (CNN) with squeeze and excitation block attention (SE) for Deepfake detection. The SE block module is designed to perform dynamic channel-wise feature recalibration. The SE block allows the network to emphasize informative features and suppress less useful ones, which leads to a more efficient and effective learning module. This module is integrated with a simple sequential model to perform Deepfake detection. The model is smaller in size and it achieves competing accuracy with the existing models for deepfake detection tasks. The model achieved an overall classification accuracy of 94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse Fake Face Dataset. Our proposed approach presents a promising avenue for combating the Deepfake challenge with minimal computational resources, developing efficient and scalable solutions for digital content verification.

Title: Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework

Authors: Xia Du, Xiaoyuan Liu, Jizhe Zhou, Zheng Lin, Chi-man Pun, Zhe Chen, Wei Ni, Jun Luo
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2506.10685
Pdf URL: https://arxiv.org/pdf/2506.10685
Copy Paste: [[2506.10685]] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework(https://arxiv.org/abs/2506.10685)
Keywords: attack, diffusion, large language model
Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on original image characteristics, resulting in distortions that hinder human interpretation and limit applicability in scenarios lacking initial input images. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel framework generating high-fidelity adversarial examples guided by attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC enhances CAPTCHA diversity and supports both targeted and untargeted attacks. For targeted attacks, the EDICT method optimizes dual latent variables in a diffusion model for superior image quality. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-UAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show BP-UAC achieves high attack success rates across diverse systems, generating natural CAPTCHAs indistinguishable to humans and DNNs.

Title: Large Language Models for Detection of Life-Threatening Texts

Authors: Thanh Thi Nguyen, Campbell Wilson, Janis Dalins
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10687
Pdf URL: https://arxiv.org/pdf/2506.10687
Copy Paste: [[2506.10687]] Large Language Models for Detection of Life-Threatening Texts(https://arxiv.org/abs/2506.10687)
Keywords: transformer, large language model
Abstract: Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.

Title: Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

Authors: Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10689
Pdf URL: https://arxiv.org/pdf/2506.10689
Copy Paste: [[2506.10689]] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery(https://arxiv.org/abs/2506.10689)
Keywords: robust
Abstract: Accurate automatic screening of minors in unconstrained images demands models that are robust to distribution shift and resilient to the children under-representation in publicly available data. To overcome these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing on the legally critical age range. To address the severe class imbalance, we introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch sampling, which equalizes twelve age bins during stochastic optimization. Further improvement is achieved with an age gap that removes edge cases from the loss. Moreover, we set a rigorous evaluation by proposing the Overall Under-Age Benchmark, with 303k cleaned training images and 110k test images, defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild shifts test "ASWIFT-20k" of 20k-images, stressing extreme pose ($>$45°), expression, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" lowers the root-mean-square-error on the ASORES-39k restricted test from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline, demonstrating strong generalization under distribution shift. For the under-12 and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and from 0.689 to 0.916, respectively.

Title: Preserving Task-Relevant Information Under Linear Concept Removal

Authors: Floris Holstege, Shauli Ravfogel, Bram Wouters
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10703
Pdf URL: https://arxiv.org/pdf/2506.10703
Copy Paste: [[2506.10703]] Preserving Task-Relevant Information Under Linear Concept Removal(https://arxiv.org/abs/2506.10703)
Keywords: protect, fair, interpretability
Abstract: Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLICE-Simultaneous Projection for LInear concept removal and Covariance prEservation-which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLICE achieves this via an oblique projection that "splices out" the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLICE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.

Title: ConTextTab: A Semantics-Aware Tabular In-Context Learner

Authors: Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10707
Pdf URL: https://arxiv.org/pdf/2506.10707
Copy Paste: [[2506.10707]] ConTextTab: A Semantics-Aware Tabular In-Context Learner(https://arxiv.org/abs/2506.10707)
Keywords: large language model
Abstract: Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. While being architecturally efficient and well-adapted to tabular data structures, current table-native ICL architectures, being trained exclusively on synthetic data, do not fully leverage the rich semantics and world knowledge contained in real-world tabular data. On another end of this spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark.

Title: Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement

Authors: Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu, Chengyu Fang, Xiu Li, Chunming He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10712
Pdf URL: https://arxiv.org/pdf/2506.10712
Copy Paste: [[2506.10712]] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement(https://arxiv.org/abs/2506.10712)
Keywords: diffusion, generative, segmentation
Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.

Title: Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet

Authors: Lorenzo Augello, John P. McCrae
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10715
Pdf URL: https://arxiv.org/pdf/2506.10715
Copy Paste: [[2506.10715]] Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet(https://arxiv.org/abs/2506.10715)
Keywords: large language model
Abstract: Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.

Title: Commitment Schemes for Multi-Party Computation

Authors: Ioan Ionescu, Ruxandra F. Olimid
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10721
Pdf URL: https://arxiv.org/pdf/2506.10721
Copy Paste: [[2506.10721]] Commitment Schemes for Multi-Party Computation(https://arxiv.org/abs/2506.10721)
Keywords: security, privacy, robust
Abstract: The paper presents an analysis of Commitment Schemes (CSs) used in Multi-Party Computation (MPC) protocols. While the individual properties of CSs and the guarantees offered by MPC have been widely studied in isolation, their interrelation in concrete protocols and applications remains mostly underexplored. This paper presents the relation between the two, with an emphasis on (security) properties and their impact on the upper layer MPC. In particular, we investigate how different types of CSs contribute to various MPC constructions and their relation to real-life applications of MPC. The paper can also serve as a tutorial for understanding the cryptographic interplay between CS and MPC, making it accessible to both researchers and practitioners. Our findings emphasize the importance of carefully selecting CS to meet the adversarial and functional requirements of MPC, thereby aiming for more robust and privacy-preserving cryptographic applications

Title: TED-LaST: Towards Robust Backdoor Defense Against Adaptive Attacks

Authors: Xiaoxing Mo, Yuxuan Cheng, Nan Sun, Leo Yu Zhang, Wei Luo, Shang Gao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10722
Pdf URL: https://arxiv.org/pdf/2506.10722
Copy Paste: [[2506.10722]] TED-LaST: Towards Robust Backdoor Defense Against Adaptive Attacks(https://arxiv.org/abs/2506.10722)
Keywords: security, defense, attack, robust, steal
Abstract: Deep Neural Networks (DNNs) are vulnerable to backdoor attacks, where attackers implant hidden triggers during training to maliciously control model behavior. Topological Evolution Dynamics (TED) has recently emerged as a powerful tool for detecting backdoor attacks in DNNs. However, TED can be vulnerable to backdoor attacks that adaptively distort topological representation distributions across network layers. To address this limitation, we propose TED-LaST (Topological Evolution Dynamics against Laundry, Slow release, and Target mapping attack strategies), a novel defense strategy that enhances TED's robustness against adaptive attacks. TED-LaST introduces two key innovations: label-supervised dynamics tracking and adaptive layer emphasis. These enhancements enable the identification of stealthy threats that evade traditional TED-based defenses, even in cases of inseparability in topological space and subtle topological perturbations. We review and classify data poisoning tricks in state-of-the-art adaptive attacks and propose enhanced adaptive attack with target mapping, which can dynamically shift malicious tasks and fully leverage the stealthiness that adaptive attacks possess. Our comprehensive experiments on multiple datasets (CIFAR-10, GTSRB, and ImageNet100) and model architectures (ResNet20, ResNet101) show that TED-LaST effectively counteracts sophisticated backdoors like Adap-Blend, Adapt-Patch, and the proposed enhanced adaptive attack. TED-LaST sets a new benchmark for robust backdoor detection, substantially enhancing DNN security against evolving threats.

Title: Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims

Authors: Priyanka Kargupta, Runchu Tian, Jiawei Han
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.10728
Pdf URL: https://arxiv.org/pdf/2506.10728
Copy Paste: [[2506.10728]] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims(https://arxiv.org/abs/2506.10728)
Keywords: robust
Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely "true" or "false" -- as is frequently the case with scientific and political claims. However, a claim (e.g., "vaccine A is better than vaccine B") can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., "how many biomedical papers believe vaccine A is more transportable than B?"). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.

Title: TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora

Authors: Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.10737
Pdf URL: https://arxiv.org/pdf/2506.10737
Copy Paste: [[2506.10737]] TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora(https://arxiv.org/abs/2506.10737)
Keywords: large language model
Abstract: The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.

Title: PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework

Authors: SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, Yeying Jin, Junfeng Luo, Xiaoming Wei, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10741
Pdf URL: https://arxiv.org/pdf/2506.10741
Copy Paste: [[2506.10741]] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework(https://arxiv.org/abs/2506.10741)
Keywords: robust
Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: this https URL

Title: ObfusBFA: A Holistic Approach to Safeguarding DNNs from Different Types of Bit-Flip Attacks

Authors: Xiaobei Yan, Han Qiu, Tianwei Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.10744
Pdf URL: https://arxiv.org/pdf/2506.10744
Copy Paste: [[2506.10744]] ObfusBFA: A Holistic Approach to Safeguarding DNNs from Different Types of Bit-Flip Attacks(https://arxiv.org/abs/2506.10744)
Keywords: protect, defense, attack
Abstract: Bit-flip attacks (BFAs) represent a serious threat to Deep Neural Networks (DNNs), where flipping a small number of bits in the model parameters or binary code can significantly degrade the model accuracy or mislead the model prediction in a desired way. Existing defenses exclusively focus on protecting models for specific attacks and platforms, while lacking effectiveness for other scenarios. We propose ObfusBFA, an efficient and holistic methodology to mitigate BFAs targeting both the high-level model weights and low-level codebase (executables or shared libraries). The key idea of ObfusBFA is to introduce random dummy operations during the model inference, which effectively transforms the delicate attacks into random bit flips, making it much harder for attackers to pinpoint and exploit vulnerable bits. We design novel algorithms to identify critical bits and insert obfuscation operations. We evaluate ObfusBFA against different types of attacks, including the adaptive scenarios where the attacker increases the flip bit budget to attempt to circumvent our defense. The results show that ObfusBFA can consistently preserve the model accuracy across various datasets and DNN architectures while significantly reducing the attack success rates. Additionally, it introduces minimal latency and storage overhead, making it a practical solution for real-world applications.

Title: One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Authors: Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10766
Pdf URL: https://arxiv.org/pdf/2506.10766
Copy Paste: [[2506.10766]] One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers(https://arxiv.org/abs/2506.10766)
Keywords: large language model
Abstract: Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.

Title: Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs

Authors: Alberto Testoni, Iacer Calixto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10769
Pdf URL: https://arxiv.org/pdf/2506.10769
Copy Paste: [[2506.10769]] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs(https://arxiv.org/abs/2506.10769)
Keywords: large language model
Abstract: Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.

Title: ME: Trigger Element Combination Backdoor Attack on Copyright Infringement

Authors: Feiyu Yang, Siyuan Liang, Aishan Liu, Dacheng Tao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10776
Pdf URL: https://arxiv.org/pdf/2506.10776
Copy Paste: [[2506.10776]] ME: Trigger Element Combination Backdoor Attack on Copyright Infringement(https://arxiv.org/abs/2506.10776)
Keywords: attack, steal, diffusion, generative
Abstract: The capability of generative diffusion models (DMs) like Stable Diffusion (SD) in replicating training data could be taken advantage of by attackers to launch the Copyright Infringement Attack, with duplicated poisoned image-text pairs. SilentBadDiffusion (SBD) is a method proposed recently, which shew outstanding performance in attacking SD in text-to-image tasks. However, the feasible data resources in this area are still limited, some of them are even constrained or prohibited due to the issues like copyright ownership or inappropriate contents; And not all of the images in current datasets are suitable for the proposed attacking methods; Besides, the state-of-the-art (SoTA) performance of SBD is far from ideal when few generated poisoning samples could be adopted for attacks. In this paper, we raised new datasets accessible for researching in attacks like SBD, and proposed Multi-Element (ME) attack method based on SBD by increasing the number of poisonous visual-text elements per poisoned sample to enhance the ability of attacking, while importing Discrete Cosine Transform (DCT) for the poisoned samples to maintain the stealthiness. The Copyright Infringement Rate (CIR) / First Attack Epoch (FAE) we got on the two new datasets were 16.78% / 39.50 and 51.20% / 23.60, respectively close to or even outperformed benchmark Pokemon and Mijourney datasets. In condition of low subsampling ratio (5%, 6 poisoned samples), MESI and DCT earned CIR / FAE of 0.23% / 84.00 and 12.73% / 65.50, both better than original SBD, which failed to attack at all.

Title: SlotPi: Physics-informed Object-centric Reasoning Models

Authors: Jian Li, Wan Han, Ning Lin, Yu-Liang Zhan, Ruizhi Chengze, Haining Wang, Yi Zhang, Hongsheng Liu, Zidong Wang, Fan Yu, Hao Sun
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10778
Pdf URL: https://arxiv.org/pdf/2506.10778
Copy Paste: [[2506.10778]] SlotPi: Physics-informed Object-centric Reasoning Models(https://arxiv.org/abs/2506.10778)
Keywords: robust
Abstract: Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real-world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot-based physics-informed object-centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio-temporal prediction module for dynamic forecasting. Our experiments highlight the model's strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model's capabilities. The model's robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.

Title: Improving Named Entity Transcription with Contextual LLM-based Revision

Authors: Viet Anh Trinh, Xinlu He, Jacob Whitehill
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10779
Pdf URL: https://arxiv.org/pdf/2506.10779
Copy Paste: [[2506.10779]] Improving Named Entity Transcription with Contextual LLM-based Revision(https://arxiv.org/abs/2506.10779)
Keywords: large language model
Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM's reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.

Title: Human-Robot Navigation using Event-based Cameras and Reinforcement Learning

Authors: Ignacio Bugueno-Cordova, Javier Ruiz-del-Solar, Rodrigo Verschae
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10790
Pdf URL: https://arxiv.org/pdf/2506.10790
Copy Paste: [[2506.10790]] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning(https://arxiv.org/abs/2506.10790)
Keywords: robust
Abstract: This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.

Title: Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints

Authors: Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10800
Pdf URL: https://arxiv.org/pdf/2506.10800
Copy Paste: [[2506.10800]] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints(https://arxiv.org/abs/2506.10800)
Keywords: large language model
Abstract: Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at this https URL.

Title: Dense Associative Memory with Epanechnikov Energy

Authors: Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian, Dmitry Krotov, Parikshit Ram
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10801
Pdf URL: https://arxiv.org/pdf/2506.10801
Copy Paste: [[2506.10801]] Dense Associative Memory with Epanechnikov Energy(https://arxiv.org/abs/2506.10801)
Keywords: generative
Abstract: We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant additional \emph{emergent} local minima while preserving perfect pattern recovery -- a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR's emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method's potential for both large-scale memory storage and generative tasks.

Title: Detecting High-Stakes Interactions with Activation Probes

Authors: Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10805
Pdf URL: https://arxiv.org/pdf/2506.10805
Copy Paste: [[2506.10805]] Detecting High-Stakes Interactions with Activation Probes(https://arxiv.org/abs/2506.10805)
Keywords: robust, large language model
Abstract: Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting "high-stakes" interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.

Title: Prompts to Summaries: Zero-Shot Language-Guided Video Summarization

Authors: Mario Barbara, Alaa Maalouf
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10807
Pdf URL: https://arxiv.org/pdf/2506.10807
Copy Paste: [[2506.10807]] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization(https://arxiv.org/abs/2506.10807)
Keywords: robust, data-free, large language model
Abstract: The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.

Title: Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders

Authors: Hui Yang, Wei Sun, Jian Liu, Jin Zheng, Jian Xiao, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10816
Pdf URL: https://arxiv.org/pdf/2506.10816
Copy Paste: [[2506.10816]] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders(https://arxiv.org/abs/2506.10816)
Keywords: robust
Abstract: Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.

Title: VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Authors: Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, Zhicheng Dou
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10821
Pdf URL: https://arxiv.org/pdf/2506.10821
Copy Paste: [[2506.10821]] VideoDeepResearch: Long Video Understanding With Agentic Tool Using(https://arxiv.org/abs/2506.10821)
Keywords: large language model
Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

Title: ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Authors: Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10822
Pdf URL: https://arxiv.org/pdf/2506.10822
Copy Paste: [[2506.10822]] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization(https://arxiv.org/abs/2506.10822)
Keywords: large language model
Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via this https URL.

Title: Efficiency Robustness of Dynamic Deep Learning Systems

Authors: Ravishka Rathnasuriya, Tingxi Li, Zexin Xu, Zihe Song, Mirazul Haque, Simin Chen, Wei Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10831
Pdf URL: https://arxiv.org/pdf/2506.10831
Copy Paste: [[2506.10831]] Efficiency Robustness of Dynamic Deep Learning Systems(https://arxiv.org/abs/2506.10831)
Keywords: secure, defense, attack, robust
Abstract: Deep Learning Systems (DLSs) are increasingly deployed in real-time applications, including those in resourceconstrained environments such as mobile and IoT devices. To address efficiency challenges, Dynamic Deep Learning Systems (DDLSs) adapt inference computation based on input complexity, reducing overhead. While this dynamic behavior improves efficiency, such behavior introduces new attack surfaces. In particular, efficiency adversarial attacks exploit these dynamic mechanisms to degrade system performance. This paper systematically explores efficiency robustness of DDLSs, presenting the first comprehensive taxonomy of efficiency attacks. We categorize these attacks based on three dynamic behaviors: (i) attacks on dynamic computations per inference, (ii) attacks on dynamic inference iterations, and (iii) attacks on dynamic output production for downstream tasks. Through an in-depth evaluation, we analyze adversarial strategies that target DDLSs efficiency and identify key challenges in securing these systems. In addition, we investigate existing defense mechanisms, demonstrating their limitations against increasingly popular efficiency attacks and the necessity for novel mitigation strategies to secure future adaptive DDLSs.

Title: Advanced fraud detection using machine learning models: enhancing financial transaction security

Authors: Nudrat Fariha, Md Nazmuddin Moin Khan, Md Iqbal Hossain, Syed Ali Reza, Joy Chakra Bortty, Kazi Sharmin Sultana, Md Shadidur Islam Jawad, Saniah Safat, Md Abdul Ahad, Maksuda Begum
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10842
Pdf URL: https://arxiv.org/pdf/2506.10842
Copy Paste: [[2506.10842]] Advanced fraud detection using machine learning models: enhancing financial transaction security(https://arxiv.org/abs/2506.10842)
Keywords: security
Abstract: The rise of digital payments has accelerated the need for intelligent and scalable systems to detect fraud. This research presents an end-to-end, feature-rich machine learning framework for detecting credit card transaction anomalies and fraud using real-world data. The study begins by merging transactional, cardholder, merchant, and merchant category datasets from a relational database to create a unified analytical view. Through the feature engineering process, we extract behavioural signals such as average spending, deviation from historical patterns, transaction timing irregularities, and category frequency metrics. These features are enriched with temporal markers such as hour, day of week, and weekend indicators to expose all latent patterns that indicate fraudulent behaviours. Exploratory data analysis reveals contextual transaction trends across all the dataset features. Using the transactional data, we train and evaluate a range of unsupervised models: Isolation Forest, One Class SVM, and a deep autoencoder trained to reconstruct normal behavior. These models flag the top 1% of reconstruction errors as outliers. PCA visualizations illustrate each models ability to separate anomalies into a two-dimensional latent space. We further segment the transaction landscape using K-Means clustering and DBSCAN to identify dense clusters of normal activity and isolate sparse, suspicious regions.

Title: Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles

Authors: Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10848
Pdf URL: https://arxiv.org/pdf/2506.10848
Copy Paste: [[2506.10848]] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles(https://arxiv.org/abs/2506.10848)
Keywords: diffusion, large language model
Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.

Title: Viability of Future Actions: Robust Safety in Reinforcement Learning via Entropy Regularization

Authors: Pierre-François Massiani, Alexander von Rohr, Lukas Haverbeck, Sebastian Trimpe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10871
Pdf URL: https://arxiv.org/pdf/2506.10871
Copy Paste: [[2506.10871]] Viability of Future Actions: Robust Safety in Reinforcement Learning via Entropy Regularization(https://arxiv.org/abs/2506.10871)
Keywords: robust
Abstract: Despite the many recent advances in reinforcement learning (RL), the question of learning policies that robustly satisfy state constraints under unknown disturbances remains open. In this paper, we offer a new perspective on achieving robust safety by analyzing the interplay between two well-established techniques in model-free RL: entropy regularization, and constraints penalization. We reveal empirically that entropy regularization in constrained RL inherently biases learning toward maximizing the number of future viable actions, thereby promoting constraints satisfaction robust to action noise. Furthermore, we show that by relaxing strict safety constraints through penalties, the constrained RL problem can be approximated arbitrarily closely by an unconstrained one and thus solved using standard model-free RL. This reformulation preserves both safety and optimality while empirically improving resilience to disturbances. Our results indicate that the connection between entropy regularization and robustness is a promising avenue for further empirical and theoretical investigation, as it enables robust safety in RL through simple reward shaping.

Title: Slimming Down LLMs Without Losing Their Minds

Authors: Qingda (Michael)Mai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10885
Pdf URL: https://arxiv.org/pdf/2506.10885
Copy Paste: [[2506.10885]] Slimming Down LLMs Without Losing Their Minds(https://arxiv.org/abs/2506.10885)
Keywords: large language model
Abstract: This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.

Title: Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Authors: Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10887
Pdf URL: https://arxiv.org/pdf/2506.10887
Copy Paste: [[2506.10887]] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers(https://arxiv.org/abs/2506.10887)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

Title: Lattice Climber Attack: Adversarial attacks for randomized mixtures of classifiers

Authors: Lucas Gnecco-Heredia, Benjamin Negrevergne, Yann Chevaleyre
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10888
Pdf URL: https://arxiv.org/pdf/2506.10888
Copy Paste: [[2506.10888]] Lattice Climber Attack: Adversarial attacks for randomized mixtures of classifiers(https://arxiv.org/abs/2506.10888)
Keywords: attack, robust
Abstract: Finite mixtures of classifiers (a.k.a. randomized ensembles) have been proposed as a way to improve robustness against adversarial attacks. However, existing attacks have been shown to not suit this kind of classifier. In this paper, we discuss the problem of attacking a mixture in a principled way and introduce two desirable properties of attacks based on a geometrical analysis of the problem (effectiveness and maximality). We then show that existing attacks do not meet both of these properties. Finally, we introduce a new attack called {\em lattice climber attack} with theoretical guarantees in the binary linear setting, and demonstrate its performance by conducting experiments on synthetic and real datasets.

Title: The Diffusion Duality

Authors: Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10892
Pdf URL: https://arxiv.org/pdf/2506.10892
Copy Paste: [[2506.10892]] The Diffusion Duality(https://arxiv.org/abs/2506.10892)
Keywords: diffusion
Abstract: Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: this http URL

Title: AIR: Zero-shot Generative Model Adaptation with Iterative Refinement

Authors: Guimeng Liu, Milad Abdollahzadeh, Ngai-Man Cheung
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10895
Pdf URL: https://arxiv.org/pdf/2506.10895
Copy Paste: [[2506.10895]] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement(https://arxiv.org/abs/2506.10895)
Keywords: generative
Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset this http URL, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.

Title: BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Authors: Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, Charlotta Lindvall
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10896
Pdf URL: https://arxiv.org/pdf/2506.10896
Copy Paste: [[2506.10896]] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP(https://arxiv.org/abs/2506.10896)
Keywords: transformer
Abstract: Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.

Title: Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Authors: Lan Zhang, Marco Valentino, Andre Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10903
Pdf URL: https://arxiv.org/pdf/2506.10903
Copy Paste: [[2506.10903]] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning(https://arxiv.org/abs/2506.10903)
Keywords: large language model
Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.

Title: NoLoCo: No-all-reduce Low Communication Training Method for Large Models

Authors: Jari Kolehmainen, Nikolay Blagoev, John Donaghy, Oğuzhan Ersoy, Christopher Nies
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10911
Pdf URL: https://arxiv.org/pdf/2506.10911
Copy Paste: [[2506.10911]] NoLoCo: No-all-reduce Low Communication Training Method for Large Models(https://arxiv.org/abs/2506.10911)
Keywords: large language model
Abstract: Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from language model training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to $4\%$ faster convergence rate with wide range of model sizes and accelerator counts.

Title: Foundation Models for Causal Inference via Prior-Data Fitted Networks

Authors: Yuchen Ma, Dennis Frauen, Emil Javurek, Stefan Feuerriegel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10914
Pdf URL: https://arxiv.org/pdf/2506.10914
Copy Paste: [[2506.10914]] Foundation Models for Causal Inference via Prior-Data Fitted Networks(https://arxiv.org/abs/2506.10914)
Keywords: transformer
Abstract: Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train a foundation model for estimating conditional average treatment effects (CATEs) using back-door adjustment. We show that CausalFM performs competitively for CATE estimation using various synthetic and semi-synthetic benchmarks. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.

Title: M4V: Multi-Modal Mamba for Text-to-Video Generation

Authors: Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10915
Pdf URL: https://arxiv.org/pdf/2506.10915
Copy Paste: [[2506.10915]] M4V: Multi-Modal Mamba for Text-to-Video Generation(https://arxiv.org/abs/2506.10915)
Keywords: diffusion, transformer
Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at this https URL.

Title: Sequential-Parallel Duality in Prefix Scannable Models

Authors: Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, Jacob Andreas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10918
Pdf URL: https://arxiv.org/pdf/2506.10918
Copy Paste: [[2506.10918]] Sequential-Parallel Duality in Prefix Scannable Models(https://arxiv.org/abs/2506.10918)
Keywords: transformer
Abstract: Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.'' This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models -- state space models -- as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models -- in some cases exhibiting better length generalization than either.

Title: Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Authors: Or Shafran, Atticus Geiger, Mor Geva
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10920
Pdf URL: https://arxiv.org/pdf/2506.10920
Copy Paste: [[2506.10920]] Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization(https://arxiv.org/abs/2506.10920)
Keywords: interpretability, large language model
Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

Title: Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Authors: Adam Karvonen, Samuel Marks
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10922
Pdf URL: https://arxiv.org/pdf/2506.10922
Copy Paste: [[2506.10922]] Robustly Improving LLM Fairness in Realistic Settings via Interpretability(https://arxiv.org/abs/2506.10922)
Keywords: robust, fair, interpretability, large language model
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10\%") induces significant racial and gender biases (up to 12\% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model's chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1\%, always below 2.5\%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.

Title: Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction

Authors: Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd, Shrikanth Narayanan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10930
Pdf URL: https://arxiv.org/pdf/2506.10930
Copy Paste: [[2506.10930]] Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction(https://arxiv.org/abs/2506.10930)
Keywords: secure
Abstract: Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community. Challenges include disagreement in labeling among annotators and imbalanced data distributions. This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset. Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling. Specifically, our best system is trained by adding text embeddings, predicting gender, and including ``Other'' (O) and ``No Agreement'' (X) samples in the training set. Our system's results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble.

Title: Dynamic Epistemic Friction in Dialogue

Authors: Timothy Obiso, Kenneth Lai, Abhijnan Nath, Nikhil Krishnaswamy, James Pustejovsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10934
Pdf URL: https://arxiv.org/pdf/2506.10934
Copy Paste: [[2506.10934]] Dynamic Epistemic Friction in Dialogue(https://arxiv.org/abs/2506.10934)
Keywords: large language model
Abstract: Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of "epistemic friction," or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent's current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.

Title: VINCIE: Unlocking In-context Image Editing from Video

Authors: Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.10941
Pdf URL: https://arxiv.org/pdf/2506.10941
Copy Paste: [[2506.10941]] VINCIE: Unlocking In-context Image Editing from Video(https://arxiv.org/abs/2506.10941)
Keywords: diffusion, transformer, segmentation
Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.

Title: Self-Adapting Language Models

Authors: Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, Pulkit Agrawal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10943
Pdf URL: https://arxiv.org/pdf/2506.10943
Copy Paste: [[2506.10943]] Self-Adapting Language Models(https://arxiv.org/abs/2506.10943)
Keywords: large language model
Abstract: Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. Given a new input, the model produces a self-edit-a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and gradient-based updates. Through supervised finetuning (SFT), these self-edits result in persistent weight updates, enabling lasting adaptation. To train the model to produce effective self-edits, we use a reinforcement learning loop with the downstream performance of the updated model as the reward signal. Unlike prior approaches that rely on separate adaptation modules or auxiliary networks, SEAL directly uses the model's own generation to control its adaptation process. Experiments on knowledge incorporation and few-shot generalization show that SEAL is a promising step toward language models capable of self-directed adaptation. Our website and code is available at this https URL.

Title: GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models

Authors: Evelyn Ma, Duo Zhou, Peizhi Niu, Huiting Zhou, Huan Zhang, Olgica Milenkovic, S. Rasoul Etesami
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10946
Pdf URL: https://arxiv.org/pdf/2506.10946
Copy Paste: [[2506.10946]] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models(https://arxiv.org/abs/2506.10946)
Keywords: privacy, protect, large language model
Abstract: Unlearning in large language models (LLMs) is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this, we propose GUARD-a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the "alignment" between the forget and retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended losses in retention. We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU benchmark across multiple LLM architectures demonstrate that GUARD substantially improves utility preservation while ensuring effective unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to 194.92% in terms of Truth Ratio when forgetting 10% of the training data.

Title: Execution Guided Line-by-Line Code Generation

Authors: Boaz Lavon, Shahar Katz, Lior Wolf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10948
Pdf URL: https://arxiv.org/pdf/2506.10948
Copy Paste: [[2506.10948]] Execution Guided Line-by-Line Code Generation(https://arxiv.org/abs/2506.10948)
Keywords: large language model
Abstract: We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming tasks. Our code is available at: this https URL

Title: Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Authors: Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10949
Pdf URL: https://arxiv.org/pdf/2506.10949
Copy Paste: [[2506.10949]] Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors(https://arxiv.org/abs/2506.10949)
Keywords: defense, attack, robust
Abstract: Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that random tasks can be injected into the decomposed subtasks to further obfuscate malicious intents. To defend in real time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each subtask. We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor. Moreover, it remains robust against random task injection and cuts cost by 90% and latency by 50%. Our findings suggest that lightweight sequential monitors are highly effective in mitigating decomposition attacks and are viable in deployment.

Title: Build the web for agents, not agents for the web

Authors: Xing Han Lù, Gaurav Kamath, Marius Mosbach, Siva Reddy
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10953
Pdf URL: https://arxiv.org/pdf/2506.10953
Copy Paste: [[2506.10953]] Build the web for agents, not agents for the web(https://arxiv.org/abs/2506.10953)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents -- AI systems capable of autonomously navigating and completing tasks within web environments. While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions. This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities. To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website. We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders. This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community.

Title: ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems

Authors: Aayush Karan, Kulin Shah, Sitan Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.10955
Pdf URL: https://arxiv.org/pdf/2506.10955
Copy Paste: [[2506.10955]] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems(https://arxiv.org/abs/2506.10955)
Keywords: diffusion
Abstract: There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models. Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs. In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods. Given a candidate solution $\hat{x}$ produced by an algorithm of the user's choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS. We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling. Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold. To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS.

Title: Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

Authors: Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao
Subjects: cs.LG, cs.AI, math.ST
Abstract URL: https://arxiv.org/abs/2506.10959
Pdf URL: https://arxiv.org/pdf/2506.10959
Copy Paste: [[2506.10959]] Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods(https://arxiv.org/abs/2506.10959)
Keywords: transformer
Abstract: While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding--particularly in the context of structured geometric data--remains unexplored. In this work, we initiate a theoretical study of ICL for regression of Hölder functions on manifolds. By establishing a novel connection between the attention mechanism and classical kernel methods, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

Title: ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Authors: Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng
Subjects: cs.CL, cs.AI, cs.CR, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10960
Pdf URL: https://arxiv.org/pdf/2506.10960
Copy Paste: [[2506.10960]] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark(https://arxiv.org/abs/2506.10960)
Keywords: large language model
Abstract: Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at this https URL.

Title: SpectralAR: Spectral Autoregressive Visual Generation

Authors: Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10962
Pdf URL: https://arxiv.org/pdf/2506.10962
Copy Paste: [[2506.10962]] SpectralAR: Spectral Autoregressive Visual Generation(https://arxiv.org/abs/2506.10962)
Keywords: diffusion
Abstract: Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: this https URL.

Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Authors: Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10963
Pdf URL: https://arxiv.org/pdf/2506.10963
Copy Paste: [[2506.10963]] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning(https://arxiv.org/abs/2506.10963)
Keywords: diffusion
Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning--a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits--low entity fidelity, weak relations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.

Title: Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Authors: Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10967
Pdf URL: https://arxiv.org/pdf/2506.10967
Copy Paste: [[2506.10967]] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs(https://arxiv.org/abs/2506.10967)
Keywords: large language model
Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at this https URL.

Title: Farseer: A Refined Scaling Law in Large Language Models

Authors: Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10972
Pdf URL: https://arxiv.org/pdf/2506.10972
Copy Paste: [[2506.10972]] Farseer: A Refined Scaling Law in Large Language Models(https://arxiv.org/abs/2506.10972)
Keywords: robust, large language model
Abstract: Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface $L(N,D)$, Farseer achieves a significantly better fit to empirical data than prior laws (e.g., Chinchilla's law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, improving upon Chinchilla's law by reducing extrapolation error by 433\%. This allows for the reliable evaluation of competing training strategies across all $(N,D)$ settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. We are comprehensively open-sourcing all models, data, results, and logs at this https URL to foster further research.

Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Authors: Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.HC, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2506.10974
Pdf URL: https://arxiv.org/pdf/2506.10974
Copy Paste: [[2506.10974]] AutoMind: Adaptive Knowledgeable Agent for Automated Data Science(https://arxiv.org/abs/2506.10974)
Keywords: robust, large language model
Abstract: Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.

Title: QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction

Authors: Sicheng Zuo, Wenzhao Zheng, Xiaoyong Han, Longchao Yang, Yong Pan, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10977
Pdf URL: https://arxiv.org/pdf/2506.10977
Copy Paste: [[2506.10977]] QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction(https://arxiv.org/abs/2506.10977)
Keywords: robust
Abstract: 3D occupancy prediction is crucial for robust autonomous driving systems as it enables comprehensive perception of environmental structures and semantics. Most existing methods employ dense voxel-based scene representations, ignoring the sparsity of driving scenes and resulting in inefficiency. Recent works explore object-centric representations based on sparse Gaussians, but their ellipsoidal shape prior limits the modeling of diverse structures. In real-world driving scenes, objects exhibit rich geometries (e.g., cuboids, cylinders, and irregular shapes), necessitating excessive ellipsoidal Gaussians densely packed for accurate modeling, which leads to inefficient representations. To address this, we propose to use geometrically expressive superquadrics as scene primitives, enabling efficient representation of complex structures with fewer primitives through their inherent shape diversity. We develop a probabilistic superquadric mixture model, which interprets each superquadric as an occupancy probability distribution with a corresponding geometry prior, and calculates semantics through probabilistic mixture. Building on this, we present QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction, and introduce a pruning-and-splitting module to further enhance modeling efficiency by concentrating superquadrics in occupied regions. Extensive experiments on the nuScenes dataset demonstrate that QuadricFormer achieves state-of-the-art performance while maintaining superior efficiency.

Title: Fine-Grained Perturbation Guidance via Attention Head Selection

Authors: Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Saungwu Lee, Sayak Paul, Susung Hong, Seungryong Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10978
Pdf URL: https://arxiv.org/pdf/2506.10978
Copy Paste: [[2506.10978]] Fine-Grained Perturbation Guidance via Attention Head Selection(https://arxiv.org/abs/2506.10978)
Keywords: diffusion, transformer
Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

Title: SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

Authors: Weiliang Chen, Jiayi Bi, Yuanhui Huang, Wenzhao Zheng, Yueqi Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10981
Pdf URL: https://arxiv.org/pdf/2506.10981
Copy Paste: [[2506.10981]] SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis(https://arxiv.org/abs/2506.10981)
Keywords: diffusion, generative
Abstract: Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: this https URL

Title: Rethinking Losses for Diffusion Bridge Samplers

Authors: Sebastian Sanokowski, Lukas Gruber, Christoph Bartmann, Sepp Hochreiter, Sebastian Lehner
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.10982
Pdf URL: https://arxiv.org/pdf/2506.10982
Copy Paste: [[2506.10982]] Rethinking Losses for Diffusion Bridge Samplers(https://arxiv.org/abs/2506.10982)
Keywords: diffusion
Abstract: Diffusion bridges are a promising class of deep-learning methods for sampling from unnormalized distributions. Recent works show that the Log Variance (LV) loss consistently outperforms the reverse Kullback-Leibler (rKL) loss when using the reparametrization trick to compute rKL-gradients. While the on-policy LV loss yields identical gradients to the rKL loss when combined with the log-derivative trick for diffusion samplers with non-learnable forward processes, this equivalence does not hold for diffusion bridges or when diffusion coefficients are learned. Based on this insight we argue that for diffusion bridges the LV loss does not represent an optimization objective that can be motivated like the rKL loss via the data processing inequality. Our analysis shows that employing the rKL loss with the log-derivative trick (rKL-LD) does not only avoid these conceptual problems but also consistently outperforms the LV loss. Experimental results with different types of diffusion bridges on challenging benchmarks show that samplers trained with the rKL-LD loss achieve better performance. From a practical perspective we find that rKL-LD requires significantly less hyperparameter optimization and yields more stable training behavior.