2025-06-30

Title: VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Authors: Hyeongcheol Park, MinHyuk Jang, Ha Dam Baek, Gyusam Chang, Jiyoung Seo, Jiwan Park, Hogun Park, Sangpil Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21556
Pdf URL: https://arxiv.org/pdf/2506.21556
Copy Paste: [[2506.21556]] VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation(https://arxiv.org/abs/2506.21556)
Keywords: large language model
Abstract: Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.

Title: Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning

Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21557
Pdf URL: https://arxiv.org/pdf/2506.21557
Copy Paste: [[2506.21557]] Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning(https://arxiv.org/abs/2506.21557)
Keywords: interpretability, diffusion, generative, large language model
Abstract: The rapid spread of fake news across multimedia platforms presents serious challenges to information credibility. In this paper, we propose a Debunk-and-Infer framework for Fake News Detection(DIFND) that leverages debunking knowledge to enhance both the performance and interpretability of fake news detection. DIFND integrates the generative strength of conditional diffusion models with the collaborative reasoning capabilities of multimodal large language models (MLLMs). Specifically, debunk diffusion is employed to generate refuting or authenticating evidence based on the multimodal content of news videos, enriching the evaluation process with diverse yet semantically aligned synthetic samples. To improve inference, we propose a chain-of-debunk strategy where a multi-agent MLLM system produces logic-grounded, multimodal-aware reasoning content and final veracity judgment. By jointly modeling multimodal features, generative debunking cues, and reasoning-rich verification within a unified architecture, DIFND achieves notable improvements in detection accuracy. Extensive experiments on the FakeSV and FVC datasets show that DIFND not only outperforms existing approaches but also delivers trustworthy decisions.

Title: GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations

Authors: Junze Chen, Cheng Yang, Shujie Li, Zhiqiang Zhang, Yawen Li, Junping Du, Chuan Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21559
Pdf URL: https://arxiv.org/pdf/2506.21559
Copy Paste: [[2506.21559]] GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations(https://arxiv.org/abs/2506.21559)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated their strong capabilities in various domains, and have been recently integrated for graph analysis as graph language models (GLMs). With LLMs as the predictor, some GLMs can interpret unseen tasks described by natural language, and learn from a few examples in the prompts without parameter tuning, known as in-context learning (ICL). Another subset of GLMs utilizes abundant training labels to enhance model performance, known as instruction tuning. However, we argue that ICL on graphs has effectiveness issues due to fixed parameters and efficiency issues due to long context. Meanwhile, the large amount of labeled data required for instruction tuning can be difficult to obtain in real-world scenarios. To this end, we aim to introduce an extra parameter adaptation stage that can efficiently tailor GLMs to an unseen graph and task with only a few labeled examples, in exchange for better prediction accuracy and faster inference speed. For implementation, in this paper we propose GraphLAMA method, with its model backbone and learning schemes specialized for efficient tuning and inference. Specifically, for model backbone, we use a graph neural network (GNN) with several well-designed components to transform nodes into the representation space of LLM tokens. Task instructions can then be represented as a mixture of node and language tokens. In the pre-training stage, model parameters except the LLM will be trained with different tasks to capture general knowledge. In the adaptation stage, only a few pre-trained parameters will be updated based on few-shot examples. Extensive experiments on few/zero-shot node classification and summary generation show that our proposed GraphLAMA achieves state-of-the-art performance with 4.91% absolution improvement in accuracy. Compared with ICL, our inference speed can be 10 times faster under 5-shot setting.

Title: Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs

Authors: Emilio Barkett, Olivia Long, Madhavendra Thakur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21561
Pdf URL: https://arxiv.org/pdf/2506.21561
Copy Paste: [[2506.21561]] Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs(https://arxiv.org/abs/2506.21561)
Keywords: large language model
Abstract: Despite their widespread use in fact-checking, moderation, and high-stakes decision-making, large language models (LLMs) remain poorly understood as judges of truth. This study presents the largest evaluation to date of LLMs' veracity detection capabilities and the first analysis of these capabilities in reasoning models. We had eight LLMs make 4,800 veracity judgments across several prompts, comparing reasoning and non-reasoning models. We find that rates of truth-bias, or the likelihood to believe a statement is true, regardless of whether it is actually true, are lower in reasoning models than in non-reasoning models, but still higher than human benchmarks. Most concerning, we identify sycophantic tendencies in several advanced models (o4-mini and GPT-4.1 from OpenAI, R1 from DeepSeek), which displayed an asymmetry in detection accuracy, performing well in truth accuracy but poorly in deception accuracy. This suggests that capability advances alone do not resolve fundamental veracity detection challenges in LLMs.

Title: FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction

Authors: Jun Yin, Pengyu Zeng, Jing Zhong, Peilin Li, Miao Zhang, Ran Luo, Shuai Lu
Subjects: cs.CL, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2506.21562
Pdf URL: https://arxiv.org/pdf/2506.21562
Copy Paste: [[2506.21562]] FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction(https://arxiv.org/abs/2506.21562)
Keywords: diffusion, generative, large language model
Abstract: In the architectural design process, floor plan generation is inherently progressive and iterative. However, existing generative models for floor plans are predominantly end-to-end generation that produce an entire pixel-based layout in a single pass. This paradigm is often incompatible with the incremental workflows observed in real-world architectural practice. To address this issue, we draw inspiration from the autoregressive 'next token prediction' mechanism commonly used in large language models, and propose a novel 'next room prediction' paradigm tailored to architectural floor plan modeling. Experimental evaluation indicates that FPDS demonstrates competitive performance in comparison to diffusion models and Tell2Design in the text-to-floorplan task, indicating its potential applicability in supporting future intelligent architectural design.

Title: FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models

Authors: Kaiying Kevin Lin, Hsiyu Chen, Haopeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21563
Pdf URL: https://arxiv.org/pdf/2506.21563
Copy Paste: [[2506.21563]] FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models(https://arxiv.org/abs/2506.21563)
Keywords: large language model
Abstract: While large language models (LLMs) have demonstrated impressive performance across a wide range of natural language processing (NLP) tasks in high-resource languages, their capabilities in low-resource and minority languages remain significantly underexplored. Formosan languages -- a subgroup of Austronesian languages spoken in Taiwan -- are both linguistically rich and endangered, largely due to the sociolinguistic dominance of Mandarin. In this work, we introduce FORMOSANBENCH, the first benchmark for evaluating LLMs on low-resource Austronesian languages. It covers three endangered Formosan languages: Atayal, Amis, and Paiwan, across three core NLP tasks: machine translation, automatic speech recognition (ASR), and text summarization. We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH. Our results reveal a substantial performance gap between high-resource and Formosan languages. Existing LLMs consistently underperform across all tasks, with 10-shot learning and fine-tuning offering only limited improvements. These findings underscore the urgent need for more inclusive NLP technologies that can effectively support endangered and underrepresented languages. We release our datasets and code to facilitate future research in this direction.

Title: Team QUST at SemEval-2025 Task 10: Evaluating Large Language Models in Multiclass Multi-label Classification of News Entity Framing

Authors: Jiyan Liu, Youzheng Liu, Taihang Wang, Xiaoman Xu, Yimin Wang, Ye Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21564
Pdf URL: https://arxiv.org/pdf/2506.21564
Copy Paste: [[2506.21564]] Team QUST at SemEval-2025 Task 10: Evaluating Large Language Models in Multiclass Multi-label Classification of News Entity Framing(https://arxiv.org/abs/2506.21564)
Keywords: large language model
Abstract: This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: this https URL.

Title: A Multi-Agent Probabilistic Inference Framework Inspired by Kairanban-Style CoT System with IdoBata Conversation for Debiasing

Authors: Takato Ueno, Keito Inoshita
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.21565
Pdf URL: https://arxiv.org/pdf/2506.21565
Copy Paste: [[2506.21565]] A Multi-Agent Probabilistic Inference Framework Inspired by Kairanban-Style CoT System with IdoBata Conversation for Debiasing(https://arxiv.org/abs/2506.21565)
Keywords: explainability, large language model
Abstract: Japan's kairanban culture and idobata conversations have long functioned as traditional communication practices that foster nuanced dialogue among community members and contribute to the formation of social balance. Inspired by these information exchange processes, this study proposes a multi-agent inference framework (KCS+IBC) that integrates multiple large language models (LLMs) to achieve bias mitigation, improved explainability, and probabilistic prediction in sentiment analysis. In addition to sequentially sharing prediction results, the proposed method incorporates a mid-phase casual dialogue session to blend formal inference with individual perspectives and introduces probabilistic sentiment prediction. Experimental results show that KCS achieves accuracy comparable to that of a single LLM across datasets, while KCS+IBC exhibits a consistent decrease in entropy and a gradual increase in variance during the latter stages of inference, suggesting the framework's ability to balance aggregation and diversity of predictions. Future work will quantitatively assess the impact of these characteristics on bias correction and aim to develop more advanced sentiment analysis systems.

Title: BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

Authors: Baqer M. Merzah, Tania Taami, Salman Asoudeh, Amir reza Hossein pour, Saeed Mirzaee, Amir Ali Bengari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21567
Pdf URL: https://arxiv.org/pdf/2506.21567
Copy Paste: [[2506.21567]] BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining(https://arxiv.org/abs/2506.21567)
Keywords: large language model
Abstract: Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: this https URL.

Title: Assessing RAG and HyDE on 1B vs. 4B-Parameter Gemma LLMs for Personal Assistants Integretion

Authors: Andrejs Sorstkins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21568
Pdf URL: https://arxiv.org/pdf/2506.21568
Copy Paste: [[2506.21568]] Assessing RAG and HyDE on 1B vs. 4B-Parameter Gemma LLMs for Personal Assistants Integretion(https://arxiv.org/abs/2506.21568)
Keywords: privacy, large language model
Abstract: Resource efficiency is a critical barrier to deploying large language models (LLMs) in edge and privacy-sensitive applications. This study evaluates the efficacy of two augmentation strategies--Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HyDE)--on compact Gemma LLMs of 1 billion and 4 billion parameters, within the context of a privacy-first personal assistant. We implement short-term memory via MongoDB and long-term semantic storage via Qdrant, orchestrated through FastAPI and LangChain, and expose the system through a this http URL frontend. Across both model scales, RAG consistently reduces latency by up to 17\% and eliminates factual hallucinations when responding to user-specific and domain-specific queries. HyDE, by contrast, enhances semantic relevance--particularly for complex physics prompts--but incurs a 25--40\% increase in response time and a non-negligible hallucination rate in personal-data retrieval. Comparing 1 B to 4 B models, we observe that scaling yields marginal throughput gains for baseline and RAG pipelines, but magnifies HyDE's computational overhead and variability. Our findings position RAG as the pragmatic choice for on-device personal assistants powered by small-scale LLMs.

Title: Hybrid-NL2SVA: Integrating RAG and Finetuning for LLM-based NL2SVA

Authors: Weihua Xiao, Derek Ekberg, Siddharth Garg, Ramesh Karri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21569
Pdf URL: https://arxiv.org/pdf/2506.21569
Copy Paste: [[2506.21569]] Hybrid-NL2SVA: Integrating RAG and Finetuning for LLM-based NL2SVA(https://arxiv.org/abs/2506.21569)
Keywords: large language model
Abstract: SystemVerilog Assertions (SVAs) are critical for verifying the correctness of hardware designs, but manually writing them from natural language property descriptions, i.e., NL2SVA, remains a labor-intensive and error-prone task. Recent advances in large language models (LLMs) offer opportunities to automate this translation. However, existing models still struggle with understanding domain-specific syntax and semantics. To enhance LLM performance in NL2SVA, we propose a customized retrieval-augmented generation (RAG) framework and a synthetic fine-tuning dataset that together improve LLM's performance. To further improve lightweight models over NL2SVA, our fine-tuning dataset provides prompt-guided explanations that teach LLMs the layer-by-layer construction process of concurrent SVAs, enabling supervised fine-tuning that greatly improves syntax and functionality accuracy. To evaluate the performance of LLMs over NL2SVA, we construct the largest evaluation dataset for NL2SVA, comprising 40 Verilog designs and 229 formally verified SVAs with detailed annotations. Experimental results show that our customized RAG framework increases the number of functionality matched SVAs by 58.42% over GPT-4o-mini, while Qwen2.5-Coder-7B-Instruct fine-tuned on our fine-tuning dataset and integrated with HybridRetrieval achieves a 59.05% over the base Qwen model.

Title: Towards Understanding the Cognitive Habits of Large Reasoning Models

Authors: Jianshuo Dong, Yujia Fu, Chuanrui Hu, Chao Zhang, Han Qiu
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.21571
Pdf URL: https://arxiv.org/pdf/2506.21571
Copy Paste: [[2506.21571]] Towards Understanding the Cognitive Habits of Large Reasoning Models(https://arxiv.org/abs/2506.21571)
Keywords: extraction
Abstract: Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns -- e.g., ``Wait, did I miss anything?'' -- consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs' cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs' cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs' CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: this https URL.

Title: Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Authors: Tianyu.Zou, Shengwu.Xiong, Ruilin.Yao, Jirui.Huang, Yi.Rong, Yaxiong.Chen, Shili.Xiong, Cong.Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21572
Pdf URL: https://arxiv.org/pdf/2506.21572
Copy Paste: [[2506.21572]] Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling(https://arxiv.org/abs/2506.21572)
Keywords: interpretability, large language model
Abstract: Evaluating multimodal large language models (MLLMs) remains a fundamental challenge due to a lack of structured, interpretable, and theoretically grounded benchmark designs. Existing benchmarks often adopt heuristic-based task groupings with unclear cognitive targets, thus resulting in overlapping abilities, redundant indicators, and limited diagnostic power. In this work, we propose a novel framework for aligning MLLM benchmark based on Structural Equation Modeling (SEM) to analyze and quantify the internal validity, dimensional separability, and contribution of benchmark components. Motivated by the observed limitations of current designs, we further introduce a novel capability hierarchy grounded in Piagets theory of cognitive development, dividing MLLM abilities into three hierarchical layers, i.e., Perception, Memory, and Reasoning. We reorganize existing MLLM benchmarks under the proposed framework and construct a new benchmark named Gold. Experimental results demonstrate that the proposed benchmark exhibits stronger interpretability, reduced indicator redundancy, and clearer cognitive consistency compared to existing approaches.

Title: Instruction Learning Paradigms: A Dual Perspective on White-box and Black-box LLMs

Authors: Yanwei Ren, Liu Liu, Baosheng Yu, Jiayan Qiu, Quan Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21573
Pdf URL: https://arxiv.org/pdf/2506.21573
Copy Paste: [[2506.21573]] Instruction Learning Paradigms: A Dual Perspective on White-box and Black-box LLMs(https://arxiv.org/abs/2506.21573)
Keywords: interpretability, large language model
Abstract: Optimizing instructions for large language models (LLMs) is critical for harnessing their full potential in complex and diverse tasks. However, relying solely on white-box approaches demands extensive computational resources and offers limited representational capacity, while black-box models can incur prohibitive financial costs. To address these challenges, we introduce a novel framework that seamlessly merges the strengths of both paradigms. Black-box models provide high-quality, diverse instruction initializations, and white-box models supply fine-grained interpretability through hidden states and output features. By enforcing a semantic similarity constraint, these components fuse into a unified high-dimensional representation that captures deep semantic and structural nuances, enabling an iterative optimization process to refine instruction quality and adaptability. Extensive evaluations across a broad spectrum of tasks-ranging from complex reasoning to cross-lingual generalization-demonstrate that our approach consistently outperforms state-of-the-art baselines. This fusion of black-box initialization with advanced semantic refinement yields a scalable and efficient solution, paving the way for next-generation LLM-driven applications in diverse real-world scenarios. The source code will be released soon.

Title: Digital Gatekeepers: Exploring Large Language Model's Role in Immigration Decisions

Authors: Yicheng Mao, Yang Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21574
Pdf URL: https://arxiv.org/pdf/2506.21574
Copy Paste: [[2506.21574]] Digital Gatekeepers: Exploring Large Language Model's Role in Immigration Decisions(https://arxiv.org/abs/2506.21574)
Keywords: fair, large language model
Abstract: With globalization and increasing immigrant populations, immigration departments face significant work-loads and the challenge of ensuring fairness in decision-making processes. Integrating artificial intelligence offers a promising solution to these challenges. This study investigates the potential of large language models (LLMs),such as GPT-3.5 and GPT-4, in supporting immigration decision-making. Utilizing a mixed-methods approach,this paper conducted discrete choice experiments and in-depth interviews to study LLM decision-making strategies and whether they are fair. Our findings demonstrate that LLMs can align their decision-making with human strategies, emphasizing utility maximization and procedural fairness. Meanwhile, this paper also reveals that while ChatGPT has safeguards to prevent unintentional discrimination, it still exhibits stereotypes and biases concerning nationality and shows preferences toward privileged group. This dual analysis highlights both the potential and limitations of LLMs in automating and enhancing immigration decisions.

Title: STRuCT-LLM: Unifying Tabular and Graph Reasoning with Reinforcement Learning for Semantic Parsing

Authors: Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Casper Hansen, Julien Fauqueur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21575
Pdf URL: https://arxiv.org/pdf/2506.21575
Copy Paste: [[2506.21575]] STRuCT-LLM: Unifying Tabular and Graph Reasoning with Reinforcement Learning for Semantic Parsing(https://arxiv.org/abs/2506.21575)
Keywords: large language model
Abstract: We propose STRuCT-LLM, a unified framework for training large language models (LLMs) to perform structured reasoning over both relational and graph-structured data. Our approach jointly optimizes Text-to-SQL and Text-to-Cypher tasks using reinforcement learning (RL) combined with Chain-of-Thought (CoT) supervision. To support fine-grained optimization in graph-based parsing, we introduce a topology-aware reward function based on graph edit distance. Unlike prior work that treats relational and graph formalisms in isolation, STRuCT-LLM leverages shared abstractions between SQL and Cypher to induce cross-formalism transfer, enabling SQL training to improve Cypher performance and vice versa - even without shared schemas. Our largest model (QwQ-32B) achieves substantial relative improvements across tasks: on semantic parsing, Spider improves by 13.5\% and Text2Cypher by 73.1\%. The model also demonstrates strong zero-shot generalization, improving performance on downstream tabular QA (TableBench: 8.5\%) and knowledge graph QA (CR-LT-KGQA: 1.7\%) without any QA-specific supervision. These results demonstrate both the effectiveness of executable queries as scaffolds for structured reasoning and the synergistic benefits of jointly training on SQL and Cypher (code available at this https URL).

Title: Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR

Authors: Hongli Yang, Sheng Li, Hao Huang, Ayiduosi Tuohan, Yizhou Peng
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21577
Pdf URL: https://arxiv.org/pdf/2506.21577
Copy Paste: [[2506.21577]] Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR(https://arxiv.org/abs/2506.21577)
Keywords: extraction
Abstract: Recent advancements in multilingual automatic speech recognition (ASR) have been driven by large-scale end-to-end models like Whisper. However, challenges such as language interference and expanding to unseen languages (language expansion) without degrading performance persist. This paper addresses these with three contributions: 1) Entire Soft Prompt Tuning (Entire SPT), which applies soft prompts to both the encoder and decoder, enhancing feature extraction and decoding; 2) Language-Aware Prompt Tuning (LAPT), which leverages cross-lingual similarities to encode shared and language-specific features using lightweight prompt matrices; 3) SPT-Whisper, a toolkit that integrates SPT into Whisper and enables efficient continual learning. Experiments across three languages from FLEURS demonstrate that Entire SPT and LAPT outperform Decoder SPT by 5.0% and 16.0% in language expansion tasks, respectively, providing an efficient solution for dynamic, multilingual ASR models with minimal computational overhead.

Title: HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

Authors: Andrew Maranhão Ventura D'addario
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21578
Pdf URL: https://arxiv.org/pdf/2506.21578
Copy Paste: [[2506.21578]] HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models(https://arxiv.org/abs/2506.21578)
Keywords: large language model
Abstract: The evaluation of Large Language Models (LLMs) in healthcare has been dominated by physician-centric, English-language benchmarks, creating a dangerous illusion of competence that ignores the interprofessional nature of patient care. To provide a more holistic and realistic assessment, we introduce HealthQA-BR, the first large-scale, system-wide benchmark for Portuguese-speaking healthcare. Comprising 5,632 questions from Brazil's national licensing and residency exams, it uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions. We conducted a rigorous zero-shot evaluation of over 20 leading LLMs. Our results reveal that while state-of-the-art models like GPT 4.1 achieve high overall accuracy (86.6%), this top-line score masks alarming, previously unmeasured deficiencies. A granular analysis shows performance plummets from near-perfect in specialties like Ophthalmology (98.7%) to barely passing in Neurosurgery (60.0%) and, most notably, Social Work (68.4%). This "spiky" knowledge profile is a systemic issue observed across all models, demonstrating that high-level scores are insufficient for safety validation. By publicly releasing HealthQA-BR and our evaluation suite, we provide a crucial tool to move beyond single-score evaluations and toward a more honest, granular audit of AI readiness for the entire healthcare team.

Title: From General Reasoning to Domain Expertise: Uncovering the Limits of Generalization in Large Language Models

Authors: Dana Alsagheer, Yang Lu, Abdulrahman Kamal, Omar Kamal, Mohammad Kamal, Nada Mansour, Cosmo Yang Wu, Rambiba Karanjai, Sen Li, Weidong Shi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21580
Pdf URL: https://arxiv.org/pdf/2506.21580
Copy Paste: [[2506.21580]] From General Reasoning to Domain Expertise: Uncovering the Limits of Generalization in Large Language Models(https://arxiv.org/abs/2506.21580)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains. However, effective decision-making relies heavily on strong reasoning abilities. Reasoning is the foundation for decision-making, providing the analytical and logical framework to make sound choices. Reasoning involves analyzing information, drawing inferences, and reaching conclusions based on logic or evidence. Decision-making builds on this foundation by applying the insights from reasoning to select the best course of action among alternatives. Together, these processes create a continuous cycle of thought and action aimed at achieving goals effectively. As AI technology evolves, there is a growing trend to train LLMs to excel in general reasoning. This study explores how the general reasoning capabilities of LLMs connect to their performance in domain-specific reasoning tasks.

Title: VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Authors: Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyi Liu, Kwan-Liu Ma
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.21582
Pdf URL: https://arxiv.org/pdf/2506.21582
Copy Paste: [[2506.21582]] VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents(https://arxiv.org/abs/2506.21582)
Keywords: extraction, generative, large language model
Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

Title: Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing

Authors: Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Ildar Batyrshin, Grigori Sidorov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21583
Pdf URL: https://arxiv.org/pdf/2506.21583
Copy Paste: [[2506.21583]] Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing(https://arxiv.org/abs/2506.21583)
Keywords: transformer
Abstract: Hope is a positive emotional state involving the expectation of favorable future outcomes, while hope speech refers to communication that promotes optimism, resilience, and support, particularly in adverse contexts. Although hope speech detection has gained attention in Natural Language Processing (NLP), existing research mainly focuses on high-resource languages and standardized scripts, often overlooking informal and underrepresented forms such as Roman Urdu. To the best of our knowledge, this is the first study to address hope speech detection in code-mixed Roman Urdu by introducing a carefully annotated dataset, thereby filling a critical gap in inclusive NLP research for low-resource, informal language varieties. This study makes four key contributions: (1) it introduces the first multi-class annotated dataset for Roman Urdu hope speech, comprising Generalized Hope, Realistic Hope, Unrealistic Hope, and Not Hope categories; (2) it explores the psychological foundations of hope and analyzes its linguistic patterns in code-mixed Roman Urdu to inform dataset development; (3) it proposes a custom attention-based transformer model optimized for the syntactic and semantic variability of Roman Urdu, evaluated using 5-fold cross-validation; and (4) it verifies the statistical significance of performance gains using a t-test. The proposed model, XLM-R, achieves the best performance with a cross-validation score of 0.78, outperforming the baseline SVM (0.75) and BiLSTM (0.76), with gains of 4% and 2.63% respectively.

Title: Empirical Evidence for Alignment Faking in Small LLMs and Prompt-Based Mitigation Techniques

Authors: J. Koorndijk
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21584
Pdf URL: https://arxiv.org/pdf/2506.21584
Copy Paste: [[2506.21584]] Empirical Evidence for Alignment Faking in Small LLMs and Prompt-Based Mitigation Techniques(https://arxiv.org/abs/2506.21584)
Keywords: large language model
Abstract: Current literature suggests that alignment faking (deceptive alignment) is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for alignment evaluations across model sizes and deployment settings.

Title: Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops

Authors: Christoph Brosch, Sian Brumm, Rolf Krieger, Jonas Scheffler
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21585
Pdf URL: https://arxiv.org/pdf/2506.21585
Copy Paste: [[2506.21585]] Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops(https://arxiv.org/abs/2506.21585)
Keywords: extraction, generative, large language model
Abstract: Generative AI and large language models (LLMs) offer significant potential for automating the extraction of structured information from web pages. In this work, we focus on food product pages from online retailers and explore schema-constrained extraction approaches to retrieve key product attributes, such as ingredient lists and nutrition tables. We compare two LLM-based approaches, direct extraction and indirect extraction via generated functions, evaluating them in terms of accuracy, efficiency, and cost on a curated dataset of 3,000 food product pages from three different online shops. Our results show that although the indirect approach achieves slightly lower accuracy (96.48\%, $-1.61\%$ compared to direct extraction), it reduces the number of required LLM calls by 95.82\%, leading to substantial efficiency gains and lower operational costs. These findings suggest that indirect extraction approaches can provide scalable and cost-effective solutions for large-scale information extraction tasks from template-based web pages using LLMs.

Title: Can Vision Language Models Understand Mimed Actions?

Authors: Hyundong Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, Jonathan May
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21586
Pdf URL: https://arxiv.org/pdf/2506.21586
Copy Paste: [[2506.21586]] Can Vision Language Models Understand Mimed Actions?(https://arxiv.org/abs/2506.21586)
Keywords: robust
Abstract: Nonverbal communication (NVC) plays an integral role in human language, but studying NVC in general is challenging because of its broad scope and high variance in interpretation among individuals and cultures. However, mime -- the theatrical technique of suggesting intent using only gesture, expression, and movement -- is a subset of NVC that consists of explicit and embodied actions with much lower human interpretation variance. We argue that a solid understanding of mimed actions is a crucial prerequisite for vision-language models capable of interpreting and commanding more subtle aspects of NVC. Hence, we propose Mime Identification Multimodal Evaluation (MIME), a novel video-based question answering benchmark comprising of 86 mimed actions. Constructed with motion capture data, MIME consists of variations of each action with perturbations applied to the character, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans on MIME, motivating the need for increased research for instilling more robust understanding of human gestures.

Title: Is DeepSeek a New Voice Among LLMs in Public Opinion Simulation?

Authors: Weihong Qi, Fan Huang, Jisun An, Haewoon Kwak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21587
Pdf URL: https://arxiv.org/pdf/2506.21587
Copy Paste: [[2506.21587]] Is DeepSeek a New Voice Among LLMs in Public Opinion Simulation?(https://arxiv.org/abs/2506.21587)
Keywords: large language model
Abstract: This study evaluates the ability of DeepSeek, an open-source large language model (LLM), to simulate public opinions in comparison to LLMs developed by major tech companies. By comparing DeepSeek-R1 and DeepSeek-V3 with Qwen2.5, GPT-4o, and Llama-3.3 and utilizing survey data from the American National Election Studies (ANES) and the Zuobiao dataset of China, we assess these models' capacity to predict public opinions on social issues in both China and the United States, highlighting their comparative capabilities between countries. Our findings indicate that DeepSeek-V3 performs best in simulating U.S. opinions on the abortion issue compared to other topics such as climate change, gun control, immigration, and services for same-sex couples, primarily because it more accurately simulates responses when provided with Democratic or liberal personas. For Chinese samples, DeepSeek-V3 performs best in simulating opinions on foreign aid and individualism but shows limitations in modeling views on capitalism, particularly failing to capture the stances of low-income and non-college-educated individuals. It does not exhibit significant differences from other models in simulating opinions on traditionalism and the free market. Further analysis reveals that all LLMs exhibit the tendency to overgeneralize a single perspective within demographic groups, often defaulting to consistent responses within groups. These findings highlight the need to mitigate cultural and demographic biases in LLM-driven public opinion modeling, calling for approaches such as more inclusive training methodologies.

Title: Understanding Verbatim Memorization in LLMs Through Circuit Discovery

Authors: Ilya Lasy, Peter Knees, Stefan Woltran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21588
Pdf URL: https://arxiv.org/pdf/2506.21588
Copy Paste: [[2506.21588]] Understanding Verbatim Memorization in LLMs Through Circuit Discovery(https://arxiv.org/abs/2506.21588)
Keywords: robust, interpretability, transformer
Abstract: Underlying mechanisms of memorization in LLMs -- the verbatim reproduction of training data -- remain poorly understood. What exact part of the network decides to retrieve a token that we would consider as start of memorization sequence? How exactly is the models' behaviour different when producing memorized sentence vs non-memorized? In this work we approach these questions from mechanistic interpretability standpoint by utilizing transformer circuits -- the minimal computational subgraphs that perform specific functions within the model. Through carefully constructed contrastive datasets, we identify points where model generation diverges from memorized content and isolate the specific circuits responsible for two distinct aspects of memorization. We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger its initiation. Intriguingly, memorization prevention mechanisms transfer robustly across different text domains, while memorization induction appears more context-dependent.

Title: A General Method for Detecting Information Generated by Large Language Models

Authors: Minjia Mao, Dongjun Wei, Xiao Fang, Michael Chau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21589
Pdf URL: https://arxiv.org/pdf/2506.21589
Copy Paste: [[2506.21589]] A General Method for Detecting Information Generated by Large Language Models(https://arxiv.org/abs/2506.21589)
Keywords: large language model
Abstract: The proliferation of large language models (LLMs) has significantly transformed the digital information landscape, making it increasingly challenging to distinguish between human-written and LLM-generated content. Detecting LLM-generated information is essential for preserving trust on digital platforms (e.g., social media and e-commerce sites) and preventing the spread of misinformation, a topic that has garnered significant attention in IS research. However, current detection methods, which primarily focus on identifying content generated by specific LLMs in known domains, face challenges in generalizing to new (i.e., unseen) LLMs and domains. This limitation reduces their effectiveness in real-world applications, where the number of LLMs is rapidly multiplying and content spans a vast array of domains. In response, we introduce a general LLM detector (GLD) that combines a twin memory networks design and a theory-guided detection generalization module to detect LLM-generated information across unseen LLMs and domains. Using real-world datasets, we conduct extensive empirical evaluations and case studies to demonstrate the superiority of GLD over state-of-the-art detection methods. The study has important academic and practical implications for digital platforms and LLMs.

Title: Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Authors: Junqi Jiang, Tom Bewley, Salim I. Amoukou, Francesco Leofante, Antonio Rago, Saumitra Mishra, Francesca Toni
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21590
Pdf URL: https://arxiv.org/pdf/2506.21590
Copy Paste: [[2506.21590]] Representation Consistency for Accurate and Coherent LLM Answer Aggregation(https://arxiv.org/abs/2506.21590)
Keywords: large language model
Abstract: Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.

Title: FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning

Authors: Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21591
Pdf URL: https://arxiv.org/pdf/2506.21591
Copy Paste: [[2506.21591]] FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning(https://arxiv.org/abs/2506.21591)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs' knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom's taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.

Title: SignBart -- New approach with the skeleton sequence for Isolated Sign language Recognition

Authors: Tinh Nguyen, Minh Khue Phan Tran
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21592
Pdf URL: https://arxiv.org/pdf/2506.21592
Copy Paste: [[2506.21592]] SignBart -- New approach with the skeleton sequence for Isolated Sign language Recognition(https://arxiv.org/abs/2506.21592)
Keywords: transformer
Abstract: Sign language recognition is crucial for individuals with hearing impairments to break communication barriers. However, previous approaches have had to choose between efficiency and accuracy. Such as RNNs, LSTMs, and GCNs, had problems with vanishing gradients and high computational costs. Despite improving performance, transformer-based methods were not commonly used. This study presents a new novel SLR approach that overcomes the challenge of independently extracting meaningful information from the x and y coordinates of skeleton sequences, which traditional models often treat as inseparable. By utilizing an encoder-decoder of BART architecture, the model independently encodes the x and y coordinates, while Cross-Attention ensures their interrelation is maintained. With only 749,888 parameters, the model achieves 96.04% accuracy on the LSA-64 dataset, significantly outperforming previous models with over one million parameters. The model also demonstrates excellent performance and generalization across WLASL and ASL-Citizen datasets. Ablation studies underscore the importance of coordinate projection, normalization, and using multiple skeleton components for boosting model efficacy. This study offers a reliable and effective approach for sign language recognition, with strong potential for enhancing accessibility tools for the deaf and hard of hearing.

Title: Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training

Authors: Ahmed M. Adly, Mostafa Samy, Amr Fawzy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21594
Pdf URL: https://arxiv.org/pdf/2506.21594
Copy Paste: [[2506.21594]] Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training(https://arxiv.org/abs/2506.21594)
Keywords: explainability
Abstract: We present Gazal-R1, a 32-billion-parameter language model that achieves state-of-the-art performance in medical reasoning while providing transparent, step-by-step explanations for clinical decision-making. Built upon Qwen3 32B, our model demonstrates that strategic training can enable mid-sized models to outperform significantly larger counterparts in specialized domains. We developed a novel two-stage training pipeline: first, supervised fine-tuning on a carefully curated dataset of 107,033 synthetic medical reasoning examples that teaches structured clinical thinking, enhanced by advanced parameter-efficient techniques including Weight-Decomposed Low-Rank Adaptation (DoRA) and Rank-Stabilized LoRA (rsLoRA); second, reinforcement learning using Group Relative Policy Optimization (GRPO) with a sophisticated multi-component reward system that refines accuracy, format adherence, and reasoning quality. Gazal-R1 achieves exceptional performance across medical benchmarks, scoring 87.1% on MedQA, 81.6% on MMLU Pro (Medical), and 79.6% on PubMedQA, surpassing models up to 12x larger. Beyond its strong empirical results, this work provides detailed insights into the challenges of training reasoning-capable models in specialized domains, including issues with reward hacking, training instability, and the fundamental tension between factual recall and detailed reasoning. Our methodology offers a reproducible framework for developing high-capability, domain-specific language models that balance performance, efficiency, and explainability.

Title: Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Authors: Hessa A. Alawwad, Anas Zafar, Areej Alhothali, Usman Naseem, Ali Alkhathlan, Amani Jamal
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21596
Pdf URL: https://arxiv.org/pdf/2506.21596
Copy Paste: [[2506.21596]] Evaluating Multimodal Large Language Models on Educational Textbook Question Answering(https://arxiv.org/abs/2506.21596)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have recently achieved significant success in vision--language tasks. However, their capacity to reason over complex, long lessons and intricate educational diagrams that cannot be represented as a single natural image remains largely untested. In this work, we present the first evaluation of state-of-the-art MLLMs on the textbook question answering (TQA) task using the CK12-QA dataset. We assess the performance of recent vision-language models, including LLaVA and LLaMA 3.2-Vision, across various input configurations. Additionally, we introduce a lightweight multimodal retrieval-augmented generation (RAG) pipeline that integrates both paragraphs and diagrams from the lesson into the prompt. Our results demonstrate the influence of retrieved educational context on model accuracy and reasoning, while also revealing current limitations in handling question-context relationships and the potential for noise, pointing to key directions for future research in multimodal AI-driven learning.

Title: Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering

Authors: Brandon Colelough, Davis Bartels, Dina Demner-Fushman
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21597
Pdf URL: https://arxiv.org/pdf/2506.21597
Copy Paste: [[2506.21597]] Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering(https://arxiv.org/abs/2506.21597)
Keywords: large language model
Abstract: In this paper, we present an overview of ClinIQLink, a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4,978 expert-verified, medical source-grounded question-answer pairs that cover seven formats: true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland's Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.

Title: Structured Attention Matters to Multimodal LLMs in Document Understanding

Authors: Chang Liu, Hongkai Chen, Yujun Cai, Hang Wu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21600
Pdf URL: https://arxiv.org/pdf/2506.21600
Copy Paste: [[2506.21600]] Structured Attention Matters to Multimodal LLMs in Document Understanding(https://arxiv.org/abs/2506.21600)
Keywords: large language model
Abstract: Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that raw OCR text often impairs rather than improves MLLMs' performance, which is a counterintuitive finding we attribute to attention dispersion and structure loss. To further substantiate our hypothesis, we propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm, maintaining the hierarchical organization and spatial relationships critical for comprehension. Our attention analysis reveals that structured text induces structured attention patterns on both textual and visual content, directing models to focus on semantically meaningful regions while reducing attention waste. This approach significantly enhances MLLMs' document question answering performance across diverse document types without requiring architectural modifications or additional training.

Title: BiMark: Unbiased Multilayer Watermarking for Large Language Models

Authors: Xiaoyan Feng, He Zhang, Yanjun Zhang, Leo Yu Zhang, Shirui Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21602
Pdf URL: https://arxiv.org/pdf/2506.21602
Copy Paste: [[2506.21602]] BiMark: Unbiased Multilayer Watermarking for Large Language Models(https://arxiv.org/abs/2506.21602)
Keywords: extraction, watermark, large language model
Abstract: Recent advances in Large Language Models (LLMs) have raised urgent concerns about LLM-generated text authenticity, prompting regulatory demands for reliable identification mechanisms. Although watermarking offers a promising solution, existing approaches struggle to simultaneously achieve three critical requirements: text quality preservation, model-agnostic detection, and message embedding capacity, which are crucial for practical implementation. To achieve these goals, the key challenge lies in balancing the trade-off between text quality preservation and message embedding capacity. To address this challenge, we propose BiMark, a novel watermarking framework that achieves these requirements through three key innovations: (1) a bit-flip unbiased reweighting mechanism enabling model-agnostic detection, (2) a multilayer architecture enhancing detectability without compromising generation quality, and (3) an information encoding approach supporting multi-bit watermarking. Through theoretical analysis and extensive experiments, we validate that, compared to state-of-the-art multi-bit watermarking methods, BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity, and performs comparably to non-watermarked text on downstream tasks such as summarization and translation.

Title: Operationalizing Automated Essay Scoring: A Human-Aware Approach

Authors: Yenisel Plasencia-Calaña
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21603
Pdf URL: https://arxiv.org/pdf/2506.21603
Copy Paste: [[2506.21603]] Operationalizing Automated Essay Scoring: A Human-Aware Approach(https://arxiv.org/abs/2506.21603)
Keywords: robust, explainability, large language model
Abstract: This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.

Title: Large Language Models as symbolic DNA of cultural dynamics

Authors: Parham Pourdavood, Michael Jacob, Terrence Deacon
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21606
Pdf URL: https://arxiv.org/pdf/2506.21606
Copy Paste: [[2506.21606]] Large Language Models as symbolic DNA of cultural dynamics(https://arxiv.org/abs/2506.21606)
Keywords: large language model
Abstract: This paper proposes a novel conceptualization of Large Language Models (LLMs) as externalized informational substrates that function analogously to DNA for human cultural dynamics. Rather than viewing LLMs as either autonomous intelligence or mere programmed mimicry, we argue they serve a broader role as repositories that preserve compressed patterns of human symbolic expression--"fossils" of meaningful dynamics that retain relational residues without their original living contexts. Crucially, these compressed patterns only become meaningful through human reinterpretation, creating a recursive feedback loop where they can be recombined and cycle back to ultimately catalyze human creative processes. Through analysis of four universal features--compression, decompression, externalization, and recursion--we demonstrate that just as DNA emerged as a compressed and externalized medium for preserving useful cellular dynamics without containing explicit reference to goal-directed physical processes, LLMs preserve useful regularities of human culture without containing understanding of embodied human experience. Therefore, we argue that LLMs' significance lies not in rivaling human intelligence, but in providing humanity a tool for self-reflection and playful hypothesis-generation in a low-stakes, simulated environment. This framework positions LLMs as tools for cultural evolvability, enabling humanity to generate novel hypotheses about itself while maintaining the human interpretation necessary to ground these hypotheses in ongoing human aesthetics and norms.

Title: CORE-KG: An LLM-Driven Knowledge Graph Construction Framework for Human Smuggling Networks

Authors: Dipak Meher, Carlotta Domeniconi, Guadalupe Correa-Cabrera
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21607
Pdf URL: https://arxiv.org/pdf/2506.21607
Copy Paste: [[2506.21607]] CORE-KG: An LLM-Driven Knowledge Graph Construction Framework for Human Smuggling Networks(https://arxiv.org/abs/2506.21607)
Keywords: extraction
Abstract: Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer valuable insights but are unstructured, lexically dense, and filled with ambiguous or shifting references-posing challenges for automated knowledge graph (KG) construction. Existing KG methods often rely on static templates and lack coreference resolution, while recent LLM-based approaches frequently produce noisy, fragmented graphs due to hallucinations, and duplicate nodes caused by a lack of guided extraction. We propose CORE-KG, a modular framework for building interpretable KGs from legal texts. It uses a two-step pipeline: (1) type-aware coreference resolution via sequential, structured LLM prompts, and (2) entity and relationship extraction using domain-guided instructions, built on an adapted GraphRAG framework. CORE-KG reduces node duplication by 33.28%, and legal noise by 38.37% compared to a GraphRAG-based baseline-resulting in cleaner and more coherent graph structures. These improvements make CORE-KG a strong foundation for analyzing complex criminal networks.

Title: From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models

Authors: Junhao Liu, Zhenhao Xu, Yuxin Fang, Yichuan Chen, Zuobin Ying, Wenhan Chang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.21609
Pdf URL: https://arxiv.org/pdf/2506.21609
Copy Paste: [[2506.21609]] From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models(https://arxiv.org/abs/2506.21609)
Keywords: robust, large language model
Abstract: Recently, there have been notable advancements in large language models (LLMs), demonstrating their growing abilities in complex reasoning. However, existing research largely overlooks a thorough and systematic comparison of these models' reasoning processes and outputs, particularly regarding their self-reflection pattern (also termed "Aha moment") and the interconnections across diverse domains. This paper proposes a novel framework for analyzing the reasoning characteristics of four cutting-edge large reasoning models (GPT-o1, DeepSeek-R1, Kimi-k1.5, and Grok-3) using keywords statistic and LLM-as-a-judge paradigm. Our approach connects their internal thinking processes with their final outputs. A diverse dataset consists of real-world scenario-based questions covering logical deduction, causal inference, and multi-step problem-solving. Additionally, a set of metrics is put forward to assess both the coherence of reasoning and the accuracy of the outputs. The research results uncover various patterns of how these models balance exploration and exploitation, deal with problems, and reach conclusions during the reasoning process. Through quantitative and qualitative comparisons, disparities among these models are identified in aspects such as the depth of reasoning, the reliance on intermediate steps, and the degree of similarity between their thinking processes and output patterns and those of GPT-o1. This work offers valuable insights into the trade-off between computational efficiency and reasoning robustness and provides practical recommendations for enhancing model design and evaluation in practical applications. We publicly release our project at: this https URL

Title: Does Multimodality Lead to Better Time Series Forecasting?

Authors: Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C. Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W. Mahoney, Hao Wang, Yan Liu, Huzefa Rangwala, George Karypis, Bernie Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21611
Pdf URL: https://arxiv.org/pdf/2506.21611
Copy Paste: [[2506.21611]] Does Multimodality Lead to Better Time Series Forecasting?(https://arxiv.org/abs/2506.21611)
Keywords: large language model
Abstract: Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Although prior works report gains from multimodal input, we find these effects are not universal across datasets and models, and multimodal methods sometimes do not outperform the strongest unimodal baselines. To understand when textual information helps, we disentangle the effects of model architectural properties and data characteristics. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our empirical findings offer practical guidelines for when multimodality can be expected to aid forecasting tasks, and when it does not.

Title: ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

Authors: Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21613
Pdf URL: https://arxiv.org/pdf/2506.21613
Copy Paste: [[2506.21613]] ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech(https://arxiv.org/abs/2506.21613)
Keywords: robust, large language model
Abstract: The increasing prevalence of child-targeted hate speech online underscores the urgent need for specialized datasets to address this critical issue. Existing hate speech datasets lack agespecific annotations, fail to capture nuanced contexts, and overlook the unique emotional impact on children. To bridge this gap, we introduce ChildGuard1, a curated dataset derived from existing corpora and enriched with child-specific annotations. ChildGuard captures diverse contexts of child-targeted hate speech, spanning age groups. We benchmark existing state-of-the-art hate speech detection methods, including Large Language Models (LLMs), and assess their effectiveness in detecting and contextualizing child-targeted hate speech. To foster further research in this area, we publicly release ChildGuard, providing a robust foundation for developing improved methods to detect and mitigate such harm.

Title: LastingBench: Defend Benchmarks Against Knowledge Leakage

Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, Xiaodong Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21614
Pdf URL: https://arxiv.org/pdf/2506.21614
Copy Paste: [[2506.21614]] LastingBench: Defend Benchmarks Against Knowledge Leakage(https://arxiv.org/abs/2506.21614)
Keywords: robust, fair, large language model
Abstract: The increasing complexity of large language models (LLMs) raises concerns about their ability to "cheat" on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While prior work has focused on detecting such leakage, little attention has been given to mitigating its impact and preserving the long-term utility of benchmarks. In this paper, we introduce LastingBench, a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. LastingBench identifies leakage points in the context through perturbation, then rewrites the leakage points to counterfactual ones-disrupting memorization while preserving the benchmark's original evaluative intent. Evaluations of state-of-the-art QA benchmarks show significant performance gaps, highlighting the efficacy of LastingBench in reducing memorization effects. LastingBench offers a practical and scalable solution to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs.

Title: Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines

Authors: Wenhao Li, Hongkuan Zhang, Hongwei Zhang, Zhengxu Li, Zengjie Dong, Yafan Chen, Niranjan Bidargaddi, Hong Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21615
Pdf URL: https://arxiv.org/pdf/2506.21615
Copy Paste: [[2506.21615]] Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines(https://arxiv.org/abs/2506.21615)
Keywords: large language model
Abstract: Current medical language models, adapted from large language models (LLMs), typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative CPGs. Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content without relying on model-generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG-based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low-cost, and hallucination-free method for grounding medical language models in evidence-based clinical practice, with strong potential for broader clinical deployment.

Title: TIM: A Large-Scale Dataset and large Timeline Intelligence Model for Open-domain Timeline Summarization

Authors: Chuanrui Hu, Wei Hu, Penghang Yu, Hua Zhang, Bing-Kun Bao
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21616
Pdf URL: https://arxiv.org/pdf/2506.21616
Copy Paste: [[2506.21616]] TIM: A Large-Scale Dataset and large Timeline Intelligence Model for Open-domain Timeline Summarization(https://arxiv.org/abs/2506.21616)
Keywords: robust, large language model
Abstract: Open-domain Timeline Summarization (TLS) is crucial for monitoring the evolution of news topics. To identify changes in news topics, existing methods typically employ general Large Language Models (LLMs) to summarize relevant timestamps from retrieved news. While general LLMs demonstrate capabilities in zero-shot news summarization and timestamp localization, they struggle with assessing topic relevance and understanding topic evolution. Consequently, the summarized information often includes irrelevant details or inaccurate timestamps. To address these issues, we propose the first large Timeline Intelligence Model (TIM) for open-domain TLS, which is capable of effectively summarizing open-domain timelines. Specifically, we begin by presenting a large-scale TLS dataset, comprising over 1,000 news topics and more than 3,000 annotated TLS instances. Furthermore, we propose a progressive optimization strategy, which gradually enhance summarization performance. It employs instruction tuning to enhance summarization and topic-irrelevant information filtering capabilities. Following this, it exploits a novel dual-alignment reward learning method that incorporates both semantic and temporal perspectives, thereby improving the understanding of topic evolution principles. Through this progressive optimization strategy, TIM demonstrates a robust ability to summarize open-domain timelines. Extensive experiments in open-domain demonstrate the effectiveness of our TIM.

Title: TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge

Authors: Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, Junchi Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21618
Pdf URL: https://arxiv.org/pdf/2506.21618
Copy Paste: [[2506.21618]] TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge(https://arxiv.org/abs/2506.21618)
Keywords: robust
Abstract: In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge 2025. We will open-source the code in the future.

Title: How Large Language Models play humans in online conversations: a simulated study of the 2016 US politics on Reddit

Authors: Daniele Cirulli, Giulio Cimini, Giovanni Palermo
Subjects: cs.CL, cs.AI, cs.CY, cs.SI, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2506.21620
Pdf URL: https://arxiv.org/pdf/2506.21620
Copy Paste: [[2506.21620]] How Large Language Models play humans in online conversations: a simulated study of the 2016 US politics on Reddit(https://arxiv.org/abs/2506.21620)
Keywords: large language model
Abstract: Large Language Models (LLMs) have recently emerged as powerful tools for natural language generation, with applications spanning from content creation to social simulations. Their ability to mimic human interactions raises both opportunities and concerns, particularly in the context of politically relevant online discussions. In this study, we evaluate the performance of LLMs in replicating user-generated content within a real-world, divisive scenario: Reddit conversations during the 2016 US Presidential election. In particular, we conduct three different experiments, asking GPT-4 to generate comments by impersonating either real or artificial partisan users. We analyze the generated comments in terms of political alignment, sentiment, and linguistic features, comparing them against real user contributions and benchmarking against a null model. We find that GPT-4 is able to produce realistic comments, both in favor of or against the candidate supported by the community, yet tending to create consensus more easily than dissent. In addition we show that real and artificial comments are well separated in a semantically embedded space, although they are indistinguishable by manual inspection. Our findings provide insights on the potential use of LLMs to sneak into online discussions, influence political debate and shape political narratives, bearing broader implications of AI-driven discourse manipulation.

Title: The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

Authors: Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21621
Pdf URL: https://arxiv.org/pdf/2506.21621
Copy Paste: [[2506.21621]] The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs(https://arxiv.org/abs/2506.21621)
Keywords: large language model
Abstract: In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.

Title: Performance of diverse evaluation metrics in NLP-based assessment and text generation of consumer complaints

Authors: Peiheng Gao, Chen Yang, Ning Sun, Ričardas Zitikis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21623
Pdf URL: https://arxiv.org/pdf/2506.21623
Copy Paste: [[2506.21623]] Performance of diverse evaluation metrics in NLP-based assessment and text generation of consumer complaints(https://arxiv.org/abs/2506.21623)
Keywords: robust, generative
Abstract: Machine learning (ML) has significantly advanced text classification by enabling automated understanding and categorization of complex, unstructured textual data. However, accurately capturing nuanced linguistic patterns and contextual variations inherent in natural language, particularly within consumer complaints, remains a challenge. This study addresses these issues by incorporating human-experience-trained algorithms that effectively recognize subtle semantic differences crucial for assessing consumer relief eligibility. Furthermore, we propose integrating synthetic data generation methods that utilize expert evaluations of generative adversarial networks and are refined through expert annotations. By combining expert-trained classifiers with high-quality synthetic data, our research seeks to significantly enhance machine learning classifier performance, reduce dataset acquisition costs, and improve overall evaluation metrics and robustness in text classification tasks.

Title: Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

Authors: Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21625
Pdf URL: https://arxiv.org/pdf/2506.21625
Copy Paste: [[2506.21625]] Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents(https://arxiv.org/abs/2506.21625)
Keywords: extraction, large language model
Abstract: Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.

Title: APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

Authors: Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21655
Pdf URL: https://arxiv.org/pdf/2506.21655
Copy Paste: [[2506.21655]] APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization(https://arxiv.org/abs/2506.21655)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model's existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model's explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7\% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at this https URL.

Title: TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation

Authors: Hakan Çapuk, Andrew Bond, Muhammed Burak Kızıl, Emir Göçen, Erkut Erdem, Aykut Erdem
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21681
Pdf URL: https://arxiv.org/pdf/2506.21681
Copy Paste: [[2506.21681]] TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation(https://arxiv.org/abs/2506.21681)
Keywords: robust, diffusion, transformer, generative
Abstract: Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDiT, a method that synthesizes panoramic scenes by generating grids of tangent-plane images covering the entire 360$^\circ$ view. Unlike previous methods relying on multiple diffusion branches, TanDiT utilizes a unified diffusion model trained to produce these tangent-plane images simultaneously within a single denoising iteration. Furthermore, we propose a model-agnostic post-processing step specifically designed to enhance global coherence across the generated panoramas. To accurately assess panoramic image quality, we also present two specialized metrics, TangentIS and TangentFID, and provide a comprehensive benchmark comprising captioned panoramic datasets and standardized evaluation scripts. Extensive experiments demonstrate that our method generalizes effectively beyond its training data, robustly interprets detailed and complex text prompts, and seamlessly integrates with various generative models to yield high-quality, diverse panoramic images.

Title: CyGym: A Simulation-Based Game-Theoretic Analysis Framework for Cybersecurity

Authors: Michael Lanier, Yevgeniy Vorobeychik
Subjects: cs.CR, cs.GT
Abstract URL: https://arxiv.org/abs/2506.21688
Pdf URL: https://arxiv.org/pdf/2506.21688
Copy Paste: [[2506.21688]] CyGym: A Simulation-Based Game-Theoretic Analysis Framework for Cybersecurity(https://arxiv.org/abs/2506.21688)
Keywords: security, defense, attack, steal
Abstract: We introduce a novel cybersecurity encounter simulator between a network defender and an attacker designed to facilitate game-theoretic modeling and analysis while maintaining many significant features of real cyber defense. Our simulator, built within the OpenAI Gym framework, incorporates realistic network topologies, vulnerabilities, exploits (including-zero-days), and defensive mechanisms. Additionally, we provide a formal simulation-based game-theoretic model of cyberdefense using this simulator, which features a novel approach to modeling zero-days exploits, and a PSRO-style approach for approximately computing equilibria in this game. We use our simulator and associated game-theoretic framework to analyze the Volt Typhoon advanced persistent threat (APT). Volt Typhoon represents a sophisticated cyber attack strategy employed by state-sponsored actors, characterized by stealthy, prolonged infiltration and exploitation of network vulnerabilities. Our experimental results demonstrate the efficacy of game-theoretic strategies in understanding network resilience against APTs and zero-days, such as Volt Typhoon, providing valuable insight into optimal defensive posture and proactive threat mitigation.

Title: Unimodal Strategies in Density-Based Clustering

Authors: Oron Nir, Jay Tenenbaum, Ariel Shamir
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21695
Pdf URL: https://arxiv.org/pdf/2506.21695
Copy Paste: [[2506.21695]] Unimodal Strategies in Density-Based Clustering(https://arxiv.org/abs/2506.21695)
Keywords: robust
Abstract: Density-based clustering methods often surpass centroid-based counterparts, when addressing data with noise or arbitrary data distributions common in real-world problems. In this study, we reveal a key property intrinsic to density-based clustering methods regarding the relation between the number of clusters and the neighborhood radius of core points - we empirically show that it is nearly unimodal, and support this claim theoretically in a specific setting. We leverage this property to devise new strategies for finding appropriate values for the radius more efficiently based on the Ternary Search algorithm. This is especially important for large scale data that is high-dimensional, where parameter tuning is computationally intensive. We validate our methodology through extensive applications across a range of high-dimensional, large-scale NLP, Audio, and Computer Vision tasks, demonstrating its practical effectiveness and robustness. This work not only offers a significant advancement in parameter control for density-based clustering but also broadens the understanding regarding the relations between their guiding parameters. Our code is available at this https URL.

Title: FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Authors: Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21710
Pdf URL: https://arxiv.org/pdf/2506.21710
Copy Paste: [[2506.21710]] FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering(https://arxiv.org/abs/2506.21710)
Keywords: large language model
Abstract: While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and two types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

Title: CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection

Authors: Aryan Thakre, Omkar Nagwekar, Vedang Talekar, Aparna Santra Biswas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21711
Pdf URL: https://arxiv.org/pdf/2506.21711
Copy Paste: [[2506.21711]] CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection(https://arxiv.org/abs/2506.21711)
Keywords: robust, transformer
Abstract: Deepfakes have emerged as a significant threat to digital media authenticity, increasing the need for advanced detection techniques that can identify subtle and time-dependent manipulations. CNNs are effective at capturing spatial artifacts, and Transformers excel at modeling temporal inconsistencies. However, many existing CNN-Transformer models process spatial and temporal features independently. In particular, attention-based methods often use separate attention mechanisms for spatial and temporal features and combine them using naive approaches like averaging, addition, or concatenation, which limits the depth of spatio-temporal interaction. To address this challenge, we propose a unified CAST model that leverages cross-attention to effectively fuse spatial and temporal features in a more integrated manner. Our approach allows temporal features to dynamically attend to relevant spatial regions, enhancing the model's ability to detect fine-grained, time-evolving artifacts such as flickering eyes or warped lips. This design enables more precise localization and deeper contextual understanding, leading to improved performance across diverse and challenging scenarios. We evaluate the performance of our model using the FaceForensics++, Celeb-DF, and DeepfakeDetection datasets in both intra- and cross-dataset settings to affirm the superiority of our approach. Our model achieves strong performance with an AUC of 99.49 percent and an accuracy of 97.57 percent in intra-dataset evaluations. In cross-dataset testing, it demonstrates impressive generalization by achieving a 93.31 percent AUC on the unseen DeepfakeDetection dataset. These results highlight the effectiveness of cross-attention-based feature fusion in enhancing the robustness of deepfake video detection.

Title: Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers

Authors: Tzu-Quan Lin, Hsi-Chun Cheng, Hung-yi Lee, Hao Tang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21712
Pdf URL: https://arxiv.org/pdf/2506.21712
Copy Paste: [[2506.21712]] Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers(https://arxiv.org/abs/2506.21712)
Keywords: protect, transformer
Abstract: In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.

Title: $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$: Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling

Authors: Denis Gudovskiy, Wenzhao Zheng, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21714
Pdf URL: https://arxiv.org/pdf/2506.21714
Copy Paste: [[2506.21714]] $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$: Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling(https://arxiv.org/abs/2506.21714)
Keywords: diffusion, transformer
Abstract: Recently, continuous normalizing flows (CNFs) and diffusion models (DMs) have been studied using the unified theoretical framework. Although such models can generate high-quality data points from a noise distribution, the sampling demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity. Most existing methods focus on reducing the number of time steps during the sampling process to improve efficiency. In this work, we explore a complementary direction in which the quality-complexity tradeoff can be dynamically controlled in terms of time steps and in the length of the neural network. We achieve this by rewiring the blocks in the transformer-based architecture to solve an inner discretized ODE w.r.t. its length. Then, we employ time- and length-wise consistency terms during flow matching training, and as a result, the sampling can be performed with an arbitrary number of time steps and transformer blocks. Unlike others, our $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$ approach is solver-agnostic in time dimension and decreases both latency and memory usage. Compared to the previous state of the art, image generation experiments on CelebA-HQ and ImageNet show a latency reduction of up to $3\times$ in the most efficient sampling mode, and a FID score improvement of up to $3.5$ points for high-quality sampling. We release our code and model weights with fully reproducible experiments.

Title: Elucidating and Endowing the Diffusion Training Paradigm for General Image Restoration

Authors: Xin Lu, Xueyang Fu, Jie Xiao, Zihao Fan, Yurui Zhu, Zheng-Jun Zha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21722
Pdf URL: https://arxiv.org/pdf/2506.21722
Copy Paste: [[2506.21722]] Elucidating and Endowing the Diffusion Training Paradigm for General Image Restoration(https://arxiv.org/abs/2506.21722)
Keywords: diffusion, generative
Abstract: While diffusion models demonstrate strong generative capabilities in image restoration (IR) tasks, their complex architectures and iterative processes limit their practical application compared to mainstream reconstruction-based general ordinary IR networks. Existing approaches primarily focus on optimizing network architecture and diffusion paths but overlook the integration of the diffusion training paradigm within general ordinary IR frameworks. To address these challenges, this paper elucidates key principles for adapting the diffusion training paradigm to general IR training through systematic analysis of time-step dependencies, network hierarchies, noise-level relationships, and multi-restoration task correlations, proposing a new IR framework supported by diffusion-based training. To enable IR networks to simultaneously restore images and model generative representations, we introduce a series of regularization strategies that align diffusion objectives with IR tasks, improving generalization in single-task scenarios. Furthermore, recognizing that diffusion-based generation exerts varying influences across different IR tasks, we develop an incremental training paradigm and task-specific adaptors, further enhancing performance in multi-task unified IR. Experiments demonstrate that our method significantly improves the generalization of IR networks in single-task IR and achieves superior performance in multi-task unified IR. Notably, the proposed framework can be seamlessly integrated into existing general IR architectures.

Title: Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis

Authors: Chenqiu Zhao, Anup Basu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21731
Pdf URL: https://arxiv.org/pdf/2506.21731
Copy Paste: [[2506.21731]] Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis(https://arxiv.org/abs/2506.21731)
Keywords: generative
Abstract: We propose two theoretical frameworks, the Mutually Exclusive Probability Space (MESP) and the Local Correlation Hypothesis (LCH), to explore a potential limitation in probabilistic generative models; namely that learning global distributions leads to memorization rather than generative behavior. MESP emerges from our rethinking of the Variational Autoencoder (VAE). We observe that latent variable distributions in VAE exhibit overlap, which leads to an optimization conflict between the reconstruction loss and KL-divergence loss. A lower bound based on the overlap coefficient is proposed. We refer to this phenomenon as Mutually Exclusive Probability Spaces. Based on MESP, a Binary Latent Autoencoder (BL-AE) is proposed to encode images into binary latent representations. These binary latents are used as the input to our Autoregressive Random Variable Model (ARVM), a modified autoregressive model outputting histograms. Our ARVM achieves competitive FID scores, outperforming state-of-the-art methods on standard datasets. However, such scores reflect memorization rather than generation. To address this issue, we propose the Local Correlation Hypothesis (LCH), which posits that generative capability arising from local correlations among latent variables. Comprehensive experiments and discussions are conducted to validate our frameworks.

Title: Equitable Federated Learning with NCA

Authors: Nick Lemke, Mirko Konstantin, Henry John Krumb, John Kalkhof, Jonathan Stieber, Anirban Mukhopadhyay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21735
Pdf URL: https://arxiv.org/pdf/2506.21735
Copy Paste: [[2506.21735]] Equitable Federated Learning with NCA(https://arxiv.org/abs/2506.21735)
Keywords: security, federate, segmentation
Abstract: Federated Learning (FL) is enabling collaborative model training across institutions without sharing sensitive patient data. This approach is particularly valuable in low- and middle-income countries (LMICs), where access to trained medical professionals is limited. However, FL adoption in LMICs faces significant barriers, including limited high-performance computing resources and unreliable internet connectivity. To address these challenges, we introduce FedNCA, a novel FL system tailored for medical image segmentation tasks. FedNCA leverages the lightweight Med-NCA architecture, enabling training on low-cost edge devices, such as widely available smartphones, while minimizing communication costs. Additionally, our encryption-ready FedNCA proves to be suitable for compromised network communication. By overcoming infrastructural and security challenges, FedNCA paves the way for inclusive, efficient, lightweight, and encryption-ready medical imaging solutions, fostering equitable healthcare advancements in resource-constrained regions.

Title: Federated Item Response Theory Models

Authors: Biying Zhou, Nanyu Luo, Feng Ji
Subjects: cs.LG, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2506.21744
Pdf URL: https://arxiv.org/pdf/2506.21744
Copy Paste: [[2506.21744]] Federated Item Response Theory Models(https://arxiv.org/abs/2506.21744)
Keywords: security, privacy, protect, federate
Abstract: Item Response Theory (IRT) models have been widely used to estimate respondents' latent abilities and calibrate items' difficulty. Traditional IRT estimation requires all individual raw response data to be centralized in one place, thus potentially causing privacy issues. Federated learning is an emerging field in computer science and machine learning with added features of privacy protection and distributed computing. To integrate the advances from federated learning with modern psychometrics, we propose a novel framework, Federated Item Response Theory (IRT), to enable estimating traditional IRT models with additional privacy, allowing estimation in a distributed manner without losing estimation accuracy. Our numerical experiments confirm that FedIRT achieves statistical accuracy similar to standard IRT estimation using popular R packages, while offering critical advantages: privacy protection and reduced communication costs. We also validate FedIRT's utility through a real-world exam dataset, demonstrating its effectiveness in realistic educational contexts. This new framework extends IRT's applicability to distributed settings, such as multi-school assessments, without sacrificing accuracy or security. To support practical adoption, we provide an open-ource R package, FedIRT, implementing the framework for the two-parameter logistic (2PL) and partial credit models (PCM).

Title: (Fact) Check Your Bias

Authors: Eivind Morris Bakke, Nora Winger Heggelund
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21745
Pdf URL: https://arxiv.org/pdf/2506.21745
Copy Paste: [[2506.21745]] (Fact) Check Your Bias(https://arxiv.org/abs/2506.21745)
Keywords: large language model
Abstract: Automatic fact verification systems increasingly rely on large language models (LLMs). We investigate how parametric knowledge biases in these models affect fact-checking outcomes of the HerO system (baseline for FEVER-25). We examine how the system is affected by: (1) potential bias in Llama 3.1's parametric knowledge and (2) intentionally injected bias. When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as "Not Enough Evidence". Using only its parametric knowledge it is able to reach a verdict on the remaining half of the claims. In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50\% of retrieved evidence being unique to each perspective. Notably, the model sometimes refuses to generate supporting documents for claims it believes to be false, creating an inherent negative bias. Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies. The code is available at: this https URL

Title: M3PO: Massively Multi-Task Model-Based Policy Optimization

Authors: Aditya Narendra, Dmitry Makarov, Aleksandr Panov
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21782
Pdf URL: https://arxiv.org/pdf/2506.21782
Copy Paste: [[2506.21782]] M3PO: Massively Multi-Task Model-Based Policy Optimization(https://arxiv.org/abs/2506.21782)
Keywords: robust, generative
Abstract: We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks.

Title: Evaluating List Construction and Temporal Understanding capabilities of Large Language Models

Authors: Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21783
Pdf URL: https://arxiv.org/pdf/2506.21783
Copy Paste: [[2506.21783]] Evaluating List Construction and Temporal Understanding capabilities of Large Language Models(https://arxiv.org/abs/2506.21783)
Keywords: generative, large language model
Abstract: Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at this https URL.

Title: Multi-task parallelism for robust pre-training of graph foundation models on multi-source, multi-fidelity atomistic modeling data

Authors: Massimiliano Lupo Pasini, Jong Youl Choi, Pei Zhang, Kshitij Mehta, Rylie Weaver, Ashwin M. Aji, Karl W. Schulz, Jorda Polo, Prasanna Balaprakash
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, physics.atm-clus
Abstract URL: https://arxiv.org/abs/2506.21788
Pdf URL: https://arxiv.org/pdf/2506.21788
Copy Paste: [[2506.21788]] Multi-task parallelism for robust pre-training of graph foundation models on multi-source, multi-fidelity atomistic modeling data(https://arxiv.org/abs/2506.21788)
Keywords: robust
Abstract: Graph foundation models using graph neural networks promise sustainable, efficient atomistic modeling. To tackle challenges of processing multi-source, multi-fidelity data during pre-training, recent studies employ multi-task learning, in which shared message passing layers initially process input atomistic structures regardless of source, then route them to multiple decoding heads that predict data-specific outputs. This approach stabilizes pre-training and enhances a model's transferability to unexplored chemical regions. Preliminary results on approximately four million structures are encouraging, yet questions remain about generalizability to larger, more diverse datasets and scalability on supercomputers. We propose a multi-task parallelism method that distributes each head across computing resources with GPU acceleration. Implemented in the open-source HydraGNN architecture, our method was trained on over 24 million structures from five datasets and tested on the Perlmutter, Aurora, and Frontier supercomputers, demonstrating efficient scaling on all three highly heterogeneous super-computing architectures.

Title: Offensive Language Detection on Social Media Using XLNet

Authors: Reem Alothman, Hafida Benhidour, Said Kerrache
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21795
Pdf URL: https://arxiv.org/pdf/2506.21795
Copy Paste: [[2506.21795]] Offensive Language Detection on Social Media Using XLNet(https://arxiv.org/abs/2506.21795)
Keywords: robust, transformer
Abstract: The widespread use of text-based communication on social media-through chats, comments, and microblogs-has improved user interaction but has also led to an increase in offensive content, including hate speech, racism, and other forms of abuse. Due to the enormous volume of user-generated content, manual moderation is impractical, which creates a need for automated systems that can detect offensive language. Deep learning models, particularly those using transfer learning, have demonstrated significant success in understanding natural language through large-scale pretraining. In this study, we propose an automatic offensive language detection model based on XLNet, a generalized autoregressive pretraining method, and compare its performance with BERT (Bidirectional Encoder Representations from Transformers), which is a widely used baseline in natural language processing (NLP). Both models are evaluated using the Offensive Language Identification Dataset (OLID), a benchmark Twitter dataset that includes hierarchical annotations. Our experimental results show that XLNet outperforms BERT in detecting offensive content and in categorizing the types of offenses, while BERT performs slightly better in identifying the targets of the offenses. Additionally, we find that oversampling and undersampling strategies are effective in addressing class imbalance and improving classification performance. These findings highlight the potential of transfer learning and XLNet-based architectures to create robust systems for detecting offensive language on social media platforms.

Title: Towards Transparent AI: A Survey on Explainable Large Language Models

Authors: Avash Palikhe, Zhenyu Yu, Zichong Wang, Wenbin Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21812
Pdf URL: https://arxiv.org/pdf/2506.21812
Copy Paste: [[2506.21812]] Towards Transparent AI: A Survey on Explainable Large Language Models(https://arxiv.org/abs/2506.21812)
Keywords: interpretability, explainability, transformer, large language model
Abstract: Large Language Models (LLMs) have played a pivotal role in advancing Artificial Intelligence (AI). However, despite their achievements, LLMs often struggle to explain their decision-making processes, making them a 'black box' and presenting a substantial challenge to explainability. This lack of transparency poses a significant obstacle to the adoption of LLMs in high-stakes domain applications, where interpretability is particularly essential. To overcome these limitations, researchers have developed various explainable artificial intelligence (XAI) methods that provide human-interpretable explanations for LLMs. However, a systematic understanding of these methods remains limited. To address this gap, this survey provides a comprehensive review of explainability techniques by categorizing XAI methods based on the underlying transformer architectures of LLMs: encoder-only, decoder-only, and encoder-decoder models. Then these techniques are examined in terms of their evaluation for assessing explainability, and the survey further explores how these explanations are leveraged in practical applications. Finally, it discusses available resources, ongoing research challenges, and future directions, aiming to guide continued efforts toward developing transparent and responsible LLMs.

Title: CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery

Authors: Felix Holm, Gözde Ünver, Ghazal Ghazaei, Nassir Navab
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21813
Pdf URL: https://arxiv.org/pdf/2506.21813
Copy Paste: [[2506.21813]] CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery(https://arxiv.org/abs/2506.21813)
Keywords: segmentation
Abstract: Understanding the intricate workflows of cataract surgery requires modeling complex interactions between surgical tools, anatomical structures, and procedural techniques. Existing datasets primarily address isolated aspects of surgical analysis, such as tool detection or phase segmentation, but lack comprehensive representations that capture the semantic relationships between entities over time. This paper introduces the Cataract Surgery Scene Graph (CAT-SG) dataset, the first to provide structured annotations of tool-tissue interactions, procedural variations, and temporal dependencies. By incorporating detailed semantic relations, CAT-SG offers a holistic view of surgical workflows, enabling more accurate recognition of surgical phases and techniques. Additionally, we present a novel scene graph generation model, CatSGG, which outperforms current methods in generating structured surgical representations. The CAT-SG dataset is designed to enhance AI-driven surgical training, real-time decision support, and workflow analysis, paving the way for more intelligent, context-aware systems in clinical practice.

Title: Exploring the Structure of AI-Induced Language Change in Scientific English

Authors: Riley Galpin, Bryce Anderson, Tom S. Juzek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21817
Pdf URL: https://arxiv.org/pdf/2506.21817
Copy Paste: [[2506.21817]] Exploring the Structure of AI-Induced Language Change in Scientific English(https://arxiv.org/abs/2506.21817)
Keywords: large language model
Abstract: Scientific English has undergone rapid and unprecedented changes in recent years, with words such as "delve," "intricate," and "crucial" showing significant spikes in frequency since around 2022. These changes are widely attributed to the growing influence of Large Language Models like ChatGPT in the discourse surrounding bias and misalignment. However, apart from changes in frequency, the exact structure of these linguistic shifts has remained unclear. The present study addresses this and investigates whether these changes involve the replacement of synonyms by suddenly 'spiking words,' for example, "crucial" replacing "essential" and "key," or whether they reflect broader semantic and pragmatic qualifications. To further investigate structural changes, we include part of speech tagging in our analysis to quantify linguistic shifts over grammatical categories and differentiate between word forms, like "potential" as a noun vs. as an adjective. We systematically analyze synonym groups for widely discussed 'spiking words' based on frequency trends in scientific abstracts from PubMed. We find that entire semantic clusters often shift together, with most or all words in a group increasing in usage. This pattern suggests that changes induced by Large Language Models are primarily semantic and pragmatic rather than purely lexical. Notably, the adjective "important" shows a significant decline, which prompted us to systematically analyze decreasing lexical items. Our analysis of "collapsing" words reveals a more complex picture, which is consistent with organic language change and contrasts with the patterns of the abrupt spikes. These insights into the structure of language change contribute to our understanding of how language technology continues to shape human language.

Title: Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models

Authors: Rafael Sterzinger, Marco Peer, Robert Sablatnig
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21826
Pdf URL: https://arxiv.org/pdf/2506.21826
Copy Paste: [[2506.21826]] Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models(https://arxiv.org/abs/2506.21826)
Keywords: segmentation
Abstract: As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- & 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: this https URL.

Title: TaleForge: Interactive Multimodal System for Personalized Story Creation

Authors: Minh-Loi Nguyen, Quang-Khai Le, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21832
Pdf URL: https://arxiv.org/pdf/2506.21832
Copy Paste: [[2506.21832]] TaleForge: Interactive Multimodal System for Personalized Story Creation(https://arxiv.org/abs/2506.21832)
Keywords: diffusion, large language model
Abstract: Storytelling is a deeply personal and creative process, yet existing methods often treat users as passive consumers, offering generic plots with limited personalization. This undermines engagement and immersion, especially where individual style or appearance is crucial. We introduce TaleForge, a personalized story-generation system that integrates large language models (LLMs) and text-to-image diffusion to embed users' facial images within both narratives and illustrations. TaleForge features three interconnected modules: Story Generation, where LLMs create narratives and character descriptions from user prompts; Personalized Image Generation, merging users' faces and outfit choices into character illustrations; and Background Generation, creating scene backdrops that incorporate personalized characters. A user study demonstrated heightened engagement and ownership when individuals appeared as protagonists. Participants praised the system's real-time previews and intuitive controls, though they requested finer narrative editing tools. TaleForge advances multimodal storytelling by aligning personalized text and imagery to create immersive, user-centric experiences.

Title: PrefPaint: Enhancing Image Inpainting through Expert Human Feedback

Authors: Duy-Bao Bui, Hoang-Khang Nguyen, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21834
Pdf URL: https://arxiv.org/pdf/2506.21834
Copy Paste: [[2506.21834]] PrefPaint: Enhancing Image Inpainting through Expert Human Feedback(https://arxiv.org/abs/2506.21834)
Keywords: diffusion
Abstract: Inpainting, the process of filling missing or corrupted image parts, has broad applications, including medical imaging. However, in specialized fields like medical polyps imaging, where accuracy and reliability are critical, inpainting models can generate inaccurate images, leading to significant errors in medical diagnosis and treatment. To ensure reliability, medical images should be annotated by experts like oncologists for effective model training. We propose PrefPaint, an approach that incorporates human feedback into the training process of Stable Diffusion Inpainting, bypassing the need for computationally expensive reward models. In addition, we develop a web-based interface streamlines training, fine-tuning, and inference. This interactive interface provides a smooth and intuitive user experience, making it easier to offer feedback and manage the fine-tuning process. User study on various domains shows that PrefPaint outperforms existing methods, reducing visual inconsistencies and improving image rendering, particularly in medical contexts, where our model generates more realistic polyps images.

Title: ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts

Authors: Xiaoqi Wang, Clint Sebastian, Wenbin He, Liu Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21835
Pdf URL: https://arxiv.org/pdf/2506.21835
Copy Paste: [[2506.21835]] ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts(https://arxiv.org/abs/2506.21835)
Keywords: robust, segmentation
Abstract: The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zero-shot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at object boundaries due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-of-the-art methods on the Pascal-5$^i$ and COCO-20$^i$ datasets, providing a more robust solution for visual reference segmentation.

Title: PARSI: Persian Authorship Recognition via Stylometric Integration

Authors: Kourosh Shahnazari, Mohammadali Keshtparvar, Seyed Moein Ayyoubzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21840
Pdf URL: https://arxiv.org/pdf/2506.21840
Copy Paste: [[2506.21840]] PARSI: Persian Authorship Recognition via Stylometric Integration(https://arxiv.org/abs/2506.21840)
Keywords: transformer, generative
Abstract: The intricate linguistic, stylistic, and metrical aspects of Persian classical poetry pose a challenge for computational authorship attribution. In this work, we present a versatile framework to determine authorship among 67 prominent poets. We employ a multi-input neural framework consisting of a transformer-based language encoder complemented by features addressing the semantic, stylometric, and metrical dimensions of Persian poetry. Our feature set encompasses 100-dimensional Word2Vec embeddings, seven stylometric measures, and categorical encodings of poetic form and meter. We compiled a vast corpus of 647,653 verses of the Ganjoor digital collection, validating the data through strict preprocessing and author verification while preserving poem-level splitting to prevent overlap. This work employs verse-level classification and majority and weighted voting schemes in evaluation, revealing that weighted voting yields 71% accuracy. We further investigate threshold-based decision filtering, allowing the model to generate highly confident predictions, achieving 97% accuracy at a 0.9 threshold, though at lower coverage. Our work focuses on the integration of deep representational forms with domain-specific features for improved authorship attribution. The results illustrate the potential of our approach for automated classification and the contribution to stylistic analysis, authorship disputes, and general computational literature research. This research will facilitate further research on multilingual author attribution, style shift, and generative modeling of Persian poetry.

Title: 3D-Telepathy: Reconstructing 3D Objects from EEG Signals

Authors: Yuxiang Ge, Jionghao Cheng, Ruiquan Ge, Zhaojie Fang, Gangyong Jia, Xiang Wan, Nannan Li, Ahmed Elazab, Changmiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21843
Pdf URL: https://arxiv.org/pdf/2506.21843
Copy Paste: [[2506.21843]] 3D-Telepathy: Reconstructing 3D Objects from EEG Signals(https://arxiv.org/abs/2506.21843)
Keywords: extraction, diffusion
Abstract: Reconstructing 3D visual stimuli from Electroencephalography (EEG) data holds significant potential for applications in Brain-Computer Interfaces (BCIs) and aiding individuals with communication disorders. Traditionally, efforts have focused on converting brain activity into 2D images, neglecting the translation of EEG data into 3D objects. This limitation is noteworthy, as the human brain inherently processes three-dimensional spatial information regardless of whether observing 2D images or the real world. The neural activities captured by EEG contain rich spatial information that is inevitably lost when reconstructing only 2D images, thus limiting its practical applications in BCI. The transition from EEG data to 3D object reconstruction faces considerable obstacles. These include the presence of extensive noise within EEG signals and a scarcity of datasets that include both EEG and 3D information, which complicates the extraction process of 3D visual data. Addressing this challenging task, we propose an innovative EEG encoder architecture that integrates a dual self-attention mechanism. We use a hybrid training strategy to train the EEG Encoder, which includes cross-attention, contrastive learning, and self-supervised learning techniques. Additionally, by employing stable diffusion as a prior distribution and utilizing Variational Score Distillation to train a neural radiation field, we successfully generate 3D objects with similar content and structure from EEG data.

Title: LinguaSynth: Heterogeneous Linguistic Signals for News Classification

Authors: Duo Zhang, Junyi Mo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21848
Pdf URL: https://arxiv.org/pdf/2506.21848
Copy Paste: [[2506.21848]] LinguaSynth: Heterogeneous Linguistic Signals for News Classification(https://arxiv.org/abs/2506.21848)
Keywords: robust, interpretability, transformer
Abstract: Deep learning has significantly advanced NLP, but its reliance on large black-box models introduces critical interpretability and computational efficiency concerns. This paper proposes LinguaSynth, a novel text classification framework that strategically integrates five complementary linguistic feature types: lexical, syntactic, entity-level, word-level semantics, and document-level semantics within a transparent logistic regression model. Unlike transformer-based architectures, LinguaSynth maintains interpretability and computational efficiency, achieving an accuracy of 84.89 percent on the 20 Newsgroups dataset and surpassing a robust TF-IDF baseline by 3.32 percent. Through rigorous feature interaction analysis, we show that syntactic and entity-level signals provide essential disambiguation and effectively complement distributional semantics. LinguaSynth sets a new benchmark for interpretable, resource-efficient NLP models and challenges the prevailing assumption that deep neural networks are necessary for high-performing text classification.

Title: The Consistency Hypothesis in Uncertainty Quantification for Large Language Models

Authors: Quan Xiao, Debarun Bhattacharjya, Balaji Ganesan, Radu Marinescu, Katsiaryna Mirylenka, Nhan H Pham, Michael Glass, Junkyu Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21849
Pdf URL: https://arxiv.org/pdf/2506.21849
Copy Paste: [[2506.21849]] The Consistency Hypothesis in Uncertainty Quantification for Large Language Models(https://arxiv.org/abs/2506.21849)
Keywords: data-free, large language model
Abstract: Estimating the confidence of large language model (LLM) outputs is essential for real-world applications requiring high user trust. Black-box uncertainty quantification (UQ) methods, relying solely on model API access, have gained popularity due to their practical benefits. In this paper, we examine the implicit assumption behind several UQ methods, which use generation consistency as a proxy for confidence, an idea we formalize as the consistency hypothesis. We introduce three mathematical statements with corresponding statistical tests to capture variations of this hypothesis and metrics to evaluate LLM output conformity across tasks. Our empirical investigation, spanning 8 benchmark datasets and 3 tasks (question answering, text summarization, and text-to-SQL), highlights the prevalence of the hypothesis under different settings. Among the statements, we highlight the `Sim-Any' hypothesis as the most actionable, and demonstrate how it can be leveraged by proposing data-free black-box UQ methods that aggregate similarities between generations for confidence estimation. These approaches can outperform the closest baselines, showcasing the practical value of the empirically observed consistency hypothesis.

Title: End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model

Authors: Haofeng Wang, Fangtao Zhou, Qi Zhang, Zeyuan Chen, Enci Zhang, Zhao Wang, Xiaofeng Huang, Siwei Ma
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2506.21851
Pdf URL: https://arxiv.org/pdf/2506.21851
Copy Paste: [[2506.21851]] End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model(https://arxiv.org/abs/2506.21851)
Keywords: extraction
Abstract: RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in various applications like intelligent surveillance. However, as the number of modalities increases, the required data storage and transmission costs also double. Therefore, efficient RGB-IR data compression is essential. This work proposes a joint compression framework for RGB-IR image pair. Specifically, to fully utilize cross-modality prior information for accurate context probability modeling within and between modalities, we propose a Channel-wise Cross-modality Entropy Model (CCEM). Among CCEM, a Low-frequency Context Extraction Block (LCEB) and a Low-frequency Context Fusion Block (LCFB) are designed for extracting and aggregating the global low-frequency information from both modalities, which assist the model in predicting entropy parameters more accurately. Experimental results demonstrate that our approach outperforms existing RGB-IR image pair and single-modality compression methods on LLVIP and KAIST datasets. For instance, the proposed framework achieves a 23.1% bit rate saving on LLVIP dataset compared to the state-of-the-art RGB-IR image codec presented at CVPR 2022.

Title: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Authors: Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
Subjects: cs.CV, cs.AI, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2506.21862
Pdf URL: https://arxiv.org/pdf/2506.21862
Copy Paste: [[2506.21862]] LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs(https://arxiv.org/abs/2506.21862)
Keywords: large language model
Abstract: In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: this https URL.

Title: DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

Authors: Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21864
Pdf URL: https://arxiv.org/pdf/2506.21864
Copy Paste: [[2506.21864]] DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE(https://arxiv.org/abs/2506.21864)
Keywords: large language model
Abstract: Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at this https URL.

Title: Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images

Authors: Yanguang Sun, Jiexi Yan, Jianjun Qian, Chunyan Xu, Jian Yang, Lei Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21866
Pdf URL: https://arxiv.org/pdf/2506.21866
Copy Paste: [[2506.21866]] Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images(https://arxiv.org/abs/2506.21866)
Keywords: transformer, segmentation
Abstract: Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: this https URL.

Title: Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning

Authors: Tzu-Chun Chien, Chieh-Kai Lin, Shiang-Feng Tsai, Ruei-Chi Lai, Hung-Jen Chen, Min Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21873
Pdf URL: https://arxiv.org/pdf/2506.21873
Copy Paste: [[2506.21873]] Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning(https://arxiv.org/abs/2506.21873)
Keywords: large language model
Abstract: Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual grounding, establishing themselves as a general interface for various vision-language applications. This progress has driven the development of token pruning methods to mitigate the high computational costs associated with processing numerous visual tokens. However, we observe that pruning significantly weakens the model's grounding ability, leading to incorrect predictions and drastic performance degradation. In Referring Expression Comprehension (REC), for instance, pruning causes the accuracy of LLaVA on the RefCOCO validation set to drop from 56.14% to 15.34%. Our analysis identifies misaligned position IDs after pruning as the primary cause of this degradation, as both the order and value of these IDs are crucial for maintaining performance in grounding tasks. To address this issue, we propose Grounding-Aware Token Pruning (GAP), a simple yet effective adjustment to position IDs that recovers REC accuracy back to 51.42%, which is 90% of the original performance in the without pruning setting, all while requiring no additional training, memory, or computational overhead. Applied to models such as Shikra, MiniGPTv2, and the LLaVA series, our method consistently improves performance across various token pruning strategies.

Title: On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling

Authors: Stanley Wu, Ronik Bhaskar, Anna Yoo Jeong Ha, Shawn Shan, Haitao Zheng, Ben Y. Zhao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21874
Pdf URL: https://arxiv.org/pdf/2506.21874
Copy Paste: [[2506.21874]] On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling(https://arxiv.org/abs/2506.21874)
Keywords: defense, attack, steal, generative
Abstract: Today's text-to-image generative models are trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs). This part of the training pipeline is critical for supplying the models with large volumes of high-quality image-caption pairs during training. However, recent work suggests that VLMs are vulnerable to stealthy adversarial attacks, where adversarial perturbations are added to images to mislead the VLMs into producing incorrect captions. In this paper, we explore the feasibility of adversarial mislabeling attacks on VLMs as a mechanism to poisoning training pipelines for text-to-image models. Our experiments demonstrate that VLMs are highly vulnerable to adversarial perturbations, allowing attackers to produce benign-looking images that are consistently miscaptioned by the VLM models. This has the effect of injecting strong "dirty-label" poison samples into the training pipeline for text-to-image models, successfully altering their behavior with a small number of poisoned samples. We find that while potential defenses can be effective, they can be targeted and circumvented by adaptive attackers. This suggests a cat-and-mouse game that is likely to reduce the quality of training data and increase the cost of text-to-image model development. Finally, we demonstrate the real-world effectiveness of these attacks, achieving high attack success (over 73%) even in black-box scenarios against commercial VLMs (Google Vertex AI and Microsoft Azure).

Title: WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

Authors: Jian Zhang, Linhao Zhang, Bokai Lei, Chuhan Wu, Wei Jia, Xiao Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21875
Pdf URL: https://arxiv.org/pdf/2506.21875
Copy Paste: [[2506.21875]] WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation(https://arxiv.org/abs/2506.21875)
Keywords: large language model
Abstract: Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech's unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.

Title: A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs

Authors: Sean Kim, Hyuhng Joon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21881
Pdf URL: https://arxiv.org/pdf/2506.21881
Copy Paste: [[2506.21881]] A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs(https://arxiv.org/abs/2506.21881)
Keywords: large language model
Abstract: As large language models (LLMs) are increasingly deployed across diverse linguistic and cultural contexts, understanding their behavior in both factual and disputable scenarios is essential, especially when their outputs may shape public opinion or reinforce dominant narratives. In this paper, we define two types of bias in LLMs: model bias (bias stemming from model training) and inference bias (bias induced by the language of the query), through a two-phase evaluation. Phase 1 evaluates LLMs on factual questions where a single verifiable answer exists, assessing whether models maintain consistency across different query languages. Phase 2 expands the scope by probing geopolitically sensitive disputes, where responses may reflect culturally embedded or ideologically aligned perspectives. We construct a manually curated dataset spanning both factual and disputable QA, across four languages and question types. The results show that Phase 1 exhibits query language induced alignment, while Phase 2 reflects an interplay between the model's training context and query language. This paper offers a structured framework for evaluating LLM behavior across neutral and sensitive topics, providing insights for future LLM deployment and culturally aware evaluation practices in multilingual contexts.

Title: GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification

Authors: Basudha Pal, Sharif Amit Kamran, Brendon Lutnick, Molly Lucas, Chaitanya Parmar, Asha Patel Shah, David Apfel, Steven Fakharzadeh, Lloyd Miller, Gabriela Cula, Kristopher Standish
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21883
Pdf URL: https://arxiv.org/pdf/2506.21883
Copy Paste: [[2506.21883]] GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification(https://arxiv.org/abs/2506.21883)
Keywords: robust, interpretability
Abstract: Psoriasis (PsO) severity scoring is important for clinical trials but is hindered by inter-rater variability and the burden of in person clinical evaluation. Remote imaging using patient captured mobile photos offers scalability but introduces challenges, such as variation in lighting, background, and device quality that are often imperceptible to humans but can impact model performance. These factors, along with inconsistencies in dermatologist annotations, reduce the reliability of automated severity scoring. We propose a framework to automatically flag problematic training images that introduce spurious correlations which degrade model generalization, using a gradient based interpretability approach. By tracing the gradients of misclassified validation images, we detect training samples where model errors align with inconsistently rated examples or are affected by subtle, nonclinical artifacts. We apply this method to a ConvNeXT based weakly supervised model designed to classify PsO severity from phone images. Removing 8.2% of flagged images improves model AUC-ROC by 5% (85% to 90%) on a held out test set. Commonly, multiple annotators and an adjudication process ensure annotation accuracy, which is expensive and time consuming. Our method detects training images with annotation inconsistencies, potentially removing the need for manual review. When applied to a subset of training data rated by two dermatologists, the method identifies over 90% of cases with inter-rater disagreement by reviewing only the top 30% of samples. This improves automated scoring for remote assessments, ensuring robustness despite data collection variability.

Title: Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles

Authors: Chuheng Wei, Ziye Qin, Ziyan Zhang, Guoyuan Wu, Matthew J. Barth
Subjects: cs.CV, cs.MM, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21885
Pdf URL: https://arxiv.org/pdf/2506.21885
Copy Paste: [[2506.21885]] Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles(https://arxiv.org/abs/2506.21885)
Keywords: robust, large language model
Abstract: Multi-sensor fusion plays a critical role in enhancing perception for autonomous driving, overcoming individual sensor limitations, and enabling comprehensive environmental understanding. This paper first formalizes multi-sensor fusion strategies into data-level, feature-level, and decision-level categories and then provides a systematic review of deep learning-based methods corresponding to each strategy. We present key multi-modal datasets and discuss their applicability in addressing real-world challenges, particularly in adverse weather conditions and complex urban environments. Additionally, we explore emerging trends, including the integration of Vision-Language Models (VLMs), Large Language Models (LLMs), and the role of sensor fusion in end-to-end autonomous driving, highlighting its potential to enhance system adaptability and robustness. Our work offers valuable insights into current methods and future directions for multi-sensor fusion in autonomous driving.

Title: DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025

Authors: Umihiro Kamoto, Tatsuya Ishibashi, Noriyuki Kugo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21891
Pdf URL: https://arxiv.org/pdf/2506.21891
Copy Paste: [[2506.21891]] DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025(https://arxiv.org/abs/2506.21891)
Keywords: robust
Abstract: In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories. Our method, DIVE (Deep-search Iterative Video Exploration), adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference. This enables our system to provide highly accurate and contextually appropriate answers to even the most complex queries. Applied to the CVRR-ES benchmark, our approach achieves 81.44% accuracy on the test set, securing the top position among all participants. This report details our methodology and provides a comprehensive analysis of the experimental results, demonstrating the effectiveness of our iterative reasoning framework in achieving robust video question answering. The code is available at this https URL

Title: Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning

Authors: Fangling Jiang, Qi Li, Weining Wang, Gang Wang, Bing Liu, Zhenan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21895
Pdf URL: https://arxiv.org/pdf/2506.21895
Copy Paste: [[2506.21895]] Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning(https://arxiv.org/abs/2506.21895)
Keywords: attack, interpretability, large language model
Abstract: Recently the emergence of novel presentation attacks has drawn increasing attention to face anti-spoofing. However, existing methods tend to memorize data patterns from the training set, resulting in poor generalization to unknown attack types across different scenarios and limited interpretability. To address these challenges, this paper presents a reinforcement fine-tuning-based face anti-spoofing method that stimulates the capabilities of multimodal large language models to think and learn how to solve the anti-spoofing task itself, rather than relying on the memorization of authenticity patterns. We design verifiable class consistent reward and reasoning consistent reward, and employ a GRPO-based optimization strategy to guide the model in exploring reasoning policies from multiple perspectives to maximize expected rewards. As a result, through iterative trial-and-error learning while retaining only high-reward trajectories, the model distills highly generalizable decision-making rules from the extensive solution space to effectively address cross-domain face anti-spoofing tasks. Extensive experimental results demonstrate that our method achieves state-of-the-art cross-domain generalization performance. It generalizes well to diverse unknown attack types in unseen target domains while providing interpretable reasoning for its authenticity decisions without requiring labor-intensive textual annotations for training.

Title: One Video to Steal Them All: 3D-Printing IP Theft through Optical Side-Channels

Authors: Twisha Chattopadhyay, Fabricio Ceschin, Marco E. Garza, Dymytriy Zyunkin, Animesh Chhotaray, Aaron P. Stebner, Saman Zonouz, Raheem Beyah
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.21897
Pdf URL: https://arxiv.org/pdf/2506.21897
Copy Paste: [[2506.21897]] One Video to Steal Them All: 3D-Printing IP Theft through Optical Side-Channels(https://arxiv.org/abs/2506.21897)
Keywords: defense, attack, steal
Abstract: The 3D printing industry is rapidly growing and increasingly adopted across various sectors including manufacturing, healthcare, and defense. However, the operational setup often involves hazardous environments, necessitating remote monitoring through cameras and other sensors, which opens the door to cyber-based attacks. In this paper, we show that an adversary with access to video recordings of the 3D printing process can reverse engineer the underlying 3D print instructions. Our model tracks the printer nozzle movements during the printing process and maps the corresponding trajectory into G-code instructions. Further, it identifies the correct parameters such as feed rate and extrusion rate, enabling successful intellectual property theft. To validate this, we design an equivalence checker that quantitatively compares two sets of 3D print instructions, evaluating their similarity in producing objects alike in shape, external appearance, and internal structure. Unlike simple distance-based metrics such as normalized mean square error, our equivalence checker is both rotationally and translationally invariant, accounting for shifts in the base position of the reverse engineered instructions caused by different camera positions. Our model achieves an average accuracy of 90.87 percent and generates 30.20 percent fewer instructions compared to existing methods, which often produce faulty or inaccurate prints. Finally, we demonstrate a fully functional counterfeit object generated by reverse engineering 3D print instructions from video.

Title: TOAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments

Authors: Sheng Yun, Jianhua Pei, Ping Wang
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2506.21900
Pdf URL: https://arxiv.org/pdf/2506.21900
Copy Paste: [[2506.21900]] TOAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments(https://arxiv.org/abs/2506.21900)
Keywords: robust, diffusion, transformer
Abstract: The evolution toward 6G networks demands a fundamental shift from bit-centric transmission to semantic-aware communication that emphasizes task-relevant information. This work introduces TOAST (Task-Oriented Adaptive Semantic Transmission), a unified framework designed to address the core challenge of multi-task optimization in dynamic wireless environments through three complementary components. First, we formulate adaptive task balancing as a Markov decision process, employing deep reinforcement learning to dynamically adjust the trade-off between image reconstruction fidelity and semantic classification accuracy based on real-time channel conditions. Second, we integrate module-specific Low-Rank Adaptation (LoRA) mechanisms throughout our Swin Transformer-based joint source-channel coding architecture, enabling parameter-efficient fine-tuning that dramatically reduces adaptation overhead while maintaining full performance across diverse channel impairments including Additive White Gaussian Noise (AWGN), fading, phase noise, and impulse interference. Third, we incorporate an Elucidating diffusion model that operates in the latent space to restore features corrupted by channel noises, providing substantial quality improvements compared to baseline approaches. Extensive experiments across multiple datasets demonstrate that TOAST achieves superior performance compared to baseline approaches, with significant improvements in both classification accuracy and reconstruction quality at low Signal-to-Noise Ratio (SNR) conditions while maintaining robust performance across all tested scenarios.

Title: RAUM-Net: Regional Attention and Uncertainty-aware Mamba Network

Authors: Mingquan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21905
Pdf URL: https://arxiv.org/pdf/2506.21905
Copy Paste: [[2506.21905]] RAUM-Net: Regional Attention and Uncertainty-aware Mamba Network(https://arxiv.org/abs/2506.21905)
Keywords: robust
Abstract: Fine Grained Visual Categorization (FGVC) remains a challenging task in computer vision due to subtle inter class differences and fragile feature representations. Existing methods struggle in fine grained scenarios, especially when labeled data is scarce. We propose a semi supervised method combining Mamba based feature modeling, region attention, and Bayesian uncertainty. Our approach enhances local to global feature modeling while focusing on key areas during learning. Bayesian inference selects high quality pseudo labels for stability. Experiments show strong performance on FGVC benchmarks with occlusions, demonstrating robustness when labeled data is limited. Code is available at this https URL Net.

Title: Consumer Beware! Exploring Data Brokers' CCPA Compliance

Authors: Elina van Kempen, Isita Bagayatkar, Pavel Frolikov, Chloe Georgiou, Gene Tsudik
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21914
Pdf URL: https://arxiv.org/pdf/2506.21914
Copy Paste: [[2506.21914]] Consumer Beware! Exploring Data Brokers' CCPA Compliance(https://arxiv.org/abs/2506.21914)
Keywords: privacy, protect
Abstract: Data brokers collect and sell the personal information of millions of individuals, often without their knowledge or consent. The California Consumer Privacy Act (CCPA) grants consumers the legal right to request access to, or deletion of, their data. To facilitate these requests, California maintains an official registry of data brokers. However, the extent to which these entities comply with the law is unclear. This paper presents the first large-scale, systematic study of CCPA compliance of all 543 officially registered data brokers. Data access requests were manually submitted to each broker, followed by in-depth analyses of their responses (or lack thereof). Above 40% failed to respond at all, in an apparent violation of the CCPA. Data brokers that responded requested personal information as part of their identity verification process, including details they had not previously collected. Paradoxically, this means that exercising one's privacy rights under CCPA introduces new privacy risks. Our findings reveal rampant non-compliance and lack of standardization of the data access request process. These issues highlight an urgent need for stronger enforcement, clearer guidelines, and standardized, periodic compliance checks to enhance consumers' privacy protections and improve data broker accountability.

Title: SepFormer: Coarse-to-fine Separator Regression Network for Table Structure Recognition

Authors: Nam Quan Nguyen, Xuan Phong Pham, Tuan-Anh Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21920
Pdf URL: https://arxiv.org/pdf/2506.21920
Copy Paste: [[2506.21920]] SepFormer: Coarse-to-fine Separator Regression Network for Table Structure Recognition(https://arxiv.org/abs/2506.21920)
Keywords: robust, extraction, transformer
Abstract: The automated reconstruction of the logical arrangement of tables from image data, termed Table Structure Recognition (TSR), is fundamental for semantic data extraction. Recently, researchers have explored a wide range of techniques to tackle this problem, demonstrating significant progress. Each table is a set of vertical and horizontal separators. Following this realization, we present SepFormer, which integrates the split-and-merge paradigm into a single step through separator regression with a DETR-style architecture, improving speed and robustness. SepFormer is a coarse-to-fine approach that predicts table separators from single-line to line-strip separators with a stack of two transformer decoders. In the coarse-grained stage, the model learns to gradually refine single-line segments through decoder layers with additional angle loss. At the end of the fine-grained stage, the model predicts line-strip separators by refining sampled points from each single-line segment. Our SepFormer can run on average at 25.6 FPS while achieving comparable performance with state-of-the-art methods on several benchmark datasets, including SciTSR, PubTabNet, WTW, and iFLYTAB.

Title: SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

Authors: Zhao Jin, Rong-Cheng Tu, Jingyi Liao, Wenhao Sun, Xiao Luo, Shunyu Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21924
Pdf URL: https://arxiv.org/pdf/2506.21924
Copy Paste: [[2506.21924]] SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding(https://arxiv.org/abs/2506.21924)
Keywords: robust
Abstract: 3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D joint decision-making is efficiently performed to determine the best-matching object. By bridging spatial and semantic reasoning neural streams, SPAZER achieves robust zero-shot grounding without training on 3D-labeled data. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of 9.0% and 10.9% in accuracy.

Title: HQCM-EBTC: A Hybrid Quantum-Classical Model for Explainable Brain Tumor Classification

Authors: Marwan Ait Haddou, Mohamed Bennai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21937
Pdf URL: https://arxiv.org/pdf/2506.21937
Copy Paste: [[2506.21937]] HQCM-EBTC: A Hybrid Quantum-Classical Model for Explainable Brain Tumor Classification(https://arxiv.org/abs/2506.21937)
Keywords: interpretability
Abstract: We propose HQCM-EBTC, a hybrid quantum-classical model for automated brain tumor classification using MRI images. Trained on a dataset of 7,576 scans covering normal, meningioma, glioma, and pituitary classes, HQCM-EBTC integrates a 5-qubit, depth-2 quantum layer with 5 parallel circuits, optimized via AdamW and a composite loss blending cross-entropy and attention consistency. HQCM-EBTC achieves 96.48% accuracy, substantially outperforming the classical baseline (86.72%). It delivers higher precision and F1-scores, especially for glioma detection. t-SNE projections reveal enhanced feature separability in quantum space, and confusion matrices show lower misclassification. Attention map analysis (Jaccard Index) confirms more accurate and focused tumor localization at high-confidence thresholds. These results highlight the promise of quantum-enhanced models in medical imaging, advancing both diagnostic accuracy and interpretability for clinical brain tumor assessment.

Title: GuiderNet: A Meta-Learning Framework for Optimizing Quantum Circuit Geometry and Mitigating Barren Plateaus

Authors: Marwan Ait Haddou, Mohamed Bennai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21940
Pdf URL: https://arxiv.org/pdf/2506.21940
Copy Paste: [[2506.21940]] GuiderNet: A Meta-Learning Framework for Optimizing Quantum Circuit Geometry and Mitigating Barren Plateaus(https://arxiv.org/abs/2506.21940)
Keywords: robust
Abstract: Variational Quantum Algorithms (VQAs) offer potential for near-term quantum advantage but face challenges from barren plateaus, where gradients vanish, and poorly conditioned optimization landscapes. We introduce GuiderNet, a meta-learning framework that conditions Parameterized Quantum Circuits (PQCs) using data-dependent parameter shifts aimed at minimizing the log condition number of the Fubini-Study metric tensor. Implemented as a classical neural network, GuiderNet is meta-trained to guide PQC parameters into geometrically favorable regions and is embedded within hybrid quantum-classical pipelines to steer both initialization and adaptive modulation during training. Applied to the Kaggle Diabetes classification task, GuiderNet reduces cumulative training loss by over 5x, improves test accuracy from 75.3% to 98.6%, and increases the minority-class F1 score from 0.67 to 0.95. It also suppresses gradient explosion and stabilizes parameter updates, enabling smoother and more robust optimization. These results demonstrate that geometric meta-conditioning can mitigate barren plateaus and ill-conditioning, providing a scalable approach to enhance trainability and generalization in quantum machine learning.

Title: SDRNET: Stacked Deep Residual Network for Accurate Semantic Segmentation of Fine-Resolution Remotely Sensed Images

Authors: Naftaly Wambugu, Ruisheng Wang, Bo Guo, Tianshu Yu, Sheng Xu, Mohammed Elhassan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21945
Pdf URL: https://arxiv.org/pdf/2506.21945
Copy Paste: [[2506.21945]] SDRNET: Stacked Deep Residual Network for Accurate Semantic Segmentation of Fine-Resolution Remotely Sensed Images(https://arxiv.org/abs/2506.21945)
Keywords: robust, segmentation
Abstract: Land cover maps generated from semantic segmentation of high-resolution remotely sensed images have drawn mucon in the photogrammetry and remote sensing research community. Currently, massive fine-resolution remotely sensed (FRRS) images acquired by improving sensing and imaging technologies become available. However, accurate semantic segmentation of such FRRS images is greatly affected by substantial class disparities, the invisibility of key ground objects due to occlusion, and object size variation. Despite the extraordinary potential in deep convolutional neural networks (DCNNs) in image feature learning and representation, extracting sufficient features from FRRS images for accurate semantic segmentation is still challenging. These challenges demand the deep learning models to learn robust features and generate sufficient feature descriptors. Specifically, learning multi-contextual features to guarantee adequate coverage of varied object sizes from the ground scene and harnessing global-local contexts to overcome class disparities challenge even profound networks. Deeper networks significantly lose spatial details due to gradual downsampling processes resulting in poor segmentation results and coarse boundaries. This article presents a stacked deep residual network (SDRNet) for semantic segmentation from FRRS images. The proposed framework utilizes two stacked encoder-decoder networks to harness long-range semantics yet preserve spatial information and dilated residual blocks (DRB) between each encoder and decoder network to capture sufficient global dependencies thus improving segmentation performance. Our experimental results obtained using the ISPRS Vaihingen and Potsdam datasets demonstrate that the SDRNet performs effectively and competitively against current DCNNs in semantic segmentation.

Title: Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications

Authors: Yangyang Wan, Haotian Wang, Xuhui Yu, Jiageng Chen, Xinyu Fan, Zuyuan He
Subjects: cs.LG, physics.app-ph, physics.optics
Abstract URL: https://arxiv.org/abs/2506.21952
Pdf URL: https://arxiv.org/pdf/2506.21952
Copy Paste: [[2506.21952]] Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications(https://arxiv.org/abs/2506.21952)
Keywords: generative
Abstract: Distributed acoustic sensing (DAS) has attracted considerable attention across various fields and artificial intelligence (AI) technology plays an important role in DAS applications to realize event recognition and denoising. Existing AI models require real-world data (RWD), whether labeled or not, for training, which is contradictory to the fact of limited available event data in real-world scenarios. Here, a physics-informed DAS neural network paradigm is proposed, which does not need real-world events data for training. By physically modeling target events and the constraints of real world and DAS system, physical functions are derived to train a generative network for generation of DAS events data. DAS debackground net is trained by using the generated DAS events data to eliminate background noise in DAS data. The effectiveness of the proposed paradigm is verified in event identification application based on a public dataset of DAS spatiotemporal data and in belt conveyor fault monitoring application based on DAS time-frequency data, and achieved comparable or better performance than data-driven networks trained with RWD. Owing to the introduction of physical information and capability of background noise removal, the paradigm demonstrates generalization in same application on different sites. A fault diagnosis accuracy of 91.8% is achieved in belt conveyor field with networks which transferred from simulation test site without any fault events data of test site and field for training. The proposed paradigm is a prospective solution to address significant obstacles of data acquisition and intense noise in practical DAS applications and explore more potential fields for DAS.

Title: Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement

Authors: Hao Jiang, Yongxiang Tang, Yanxiang Zeng, Pengjia Yuan, Yanhua Cheng, Teng Sha, Xialong Liu, Peng Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21956
Pdf URL: https://arxiv.org/pdf/2506.21956
Copy Paste: [[2506.21956]] Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement(https://arxiv.org/abs/2506.21956)
Keywords: transformer, generative
Abstract: In the realm of online advertising, advertisers partake in ad auctions to obtain advertising slots, frequently taking advantage of auto-bidding tools provided by demand-side platforms. To improve the automation of these bidding systems, we adopt generative models, namely the Decision Transformer (DT), to tackle the difficulties inherent in automated bidding. Applying the Decision Transformer to the auto-bidding task enables a unified approach to sequential modeling, which efficiently overcomes short-sightedness by capturing long-term dependencies between past bidding actions and user behavior. Nevertheless, conventional DT has certain drawbacks: (1) DT necessitates a preset return-to-go (RTG) value before generating actions, which is not inherently produced; (2) The policy learned by DT is restricted by its training data, which is consists of mixed-quality trajectories. To address these challenges, we introduce the R* Decision Transformer (R* DT), developed in a three-step process: (1) R DT: Similar to traditional DT, R DT stores actions based on state and RTG value, as well as memorizing the RTG for a given state using the training set; (2) R^ DT: We forecast the highest value (within the training set) of RTG for a given state, deriving a suboptimal policy based on the current state and the forecasted supreme RTG value; (3) R* DT: Based on R^ DT, we generate trajectories and select those with high rewards (using a simulator) to augment our training dataset. This data enhancement has been shown to improve the RTG of trajectories in the training data and gradually leads the suboptimal policy towards optimality. Comprehensive tests on a publicly available bidding dataset validate the R* DT's efficacy and highlight its superiority when dealing with mixed-quality trajectories.

Title: Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding

Authors: Yixin Zha, Chuxin Wang, Wenfei Yang, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21957
Pdf URL: https://arxiv.org/pdf/2506.21957
Copy Paste: [[2506.21957]] Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding(https://arxiv.org/abs/2506.21957)
Keywords: robust
Abstract: Point cloud understanding aims to acquire robust and general feature representations from unlabeled data. Masked point modeling-based methods have recently shown significant performance across various downstream tasks. These pre-training methods rely on random masking strategies to establish the perception of point clouds by restoring corrupted point cloud inputs, which leads to the failure of capturing reasonable semantic relationships by the self-supervised models. To address this issue, we propose Semantic Masked Autoencoder, which comprises two main components: a prototype-based component semantic modeling module and a component semantic-enhanced masking strategy. Specifically, in the component semantic modeling module, we design a component semantic guidance mechanism to direct a set of learnable prototypes in capturing the semantics of different components from objects. Leveraging these prototypes, we develop a component semantic-enhanced masking strategy that addresses the limitations of random masking in effectively covering complete component structures. Furthermore, we introduce a component semantic-enhanced prompt-tuning strategy, which further leverages these prototypes to improve the performance of pre-trained models in downstream tasks. Extensive experiments conducted on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed modules.

Title: PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory

Authors: Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21961
Pdf URL: https://arxiv.org/pdf/2506.21961
Copy Paste: [[2506.21961]] PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory(https://arxiv.org/abs/2506.21961)
Keywords: large language model
Abstract: Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs' decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at this https URL.

Title: More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

Authors: Weimin Xiong, Ke Wang, Yifan Song, Hanchao Liu, Sai Zhou, Wei Peng, Sujian Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21967
Pdf URL: https://arxiv.org/pdf/2506.21967
Copy Paste: [[2506.21967]] More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents(https://arxiv.org/abs/2506.21967)
Keywords: attack
Abstract: Current evaluations of tool-integrated LLM agents typically focus on end-to-end tool-usage evaluation while neglecting their stability. This limits their real-world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool's response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.

Title: Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Authors: Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21972
Pdf URL: https://arxiv.org/pdf/2506.21972
Copy Paste: [[2506.21972]] Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses(https://arxiv.org/abs/2506.21972)
Keywords: defense, attack, robust, large language model
Abstract: The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.

Title: Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism

Authors: Simon Münker, Nils Schwager, Achim Rettinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21974
Pdf URL: https://arxiv.org/pdf/2506.21974
Copy Paste: [[2506.21974]] Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism(https://arxiv.org/abs/2506.21974)
Keywords: generative, large language model
Abstract: The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on X in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.

Title: TASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models

Authors: Meng Yu, Te Cui, Qitong Chu, Wenjie Song, Yi Yang, Yufeng Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21975
Pdf URL: https://arxiv.org/pdf/2506.21975
Copy Paste: [[2506.21975]] TASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models(https://arxiv.org/abs/2506.21975)
Keywords: transformer, segmentation
Abstract: Reliable semantic segmentation of open environments is essential for intelligent systems, yet significant problems remain: 1) Existing RGB-T semantic segmentation models mainly rely on low-level visual features and lack high-level textual information, which struggle with accurate segmentation when categories share similar visual characteristics. 2) While SAM excels in instance-level segmentation, integrating it with thermal images and text is hindered by modality heterogeneity and computational inefficiency. To address these, we propose TASeg, a text-aware RGB-T segmentation framework by using Low-Rank Adaptation (LoRA) fine-tuning technology to adapt vision foundation models. Specifically, we propose a Dynamic Feature Fusion Module (DFFM) in the image encoder, which effectively merges features from multiple visual modalities while freezing SAM's original transformer blocks. Additionally, we incorporate CLIP-generated text embeddings in the mask decoder to enable semantic alignment, which further rectifies the classification error and improves the semantic understanding accuracy. Experimental results across diverse datasets demonstrate that our method achieves superior performance in challenging scenarios with fewer trainable parameters.

Title: SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model

Authors: Shuhan Tan, John Lambert, Hong Jeon, Sakshum Kulshrestha, Yijing Bai, Jing Luo, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang
Subjects: cs.LG, cs.AI, cs.CV, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21976
Pdf URL: https://arxiv.org/pdf/2506.21976
Copy Paste: [[2506.21976]] SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model(https://arxiv.org/abs/2506.21976)
Keywords: generative
Abstract: The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the requirements above. We demonstrate the city-scale traffic simulation capability of SceneDiffuser++ and study its superior realism under long simulation conditions. We evaluate the simulation quality on an augmented version of the Waymo Open Motion Dataset (WOMD) with larger map regions to support trip-level simulation.

Title: R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning

Authors: Biao Wang, Wenwen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21980
Pdf URL: https://arxiv.org/pdf/2506.21980
Copy Paste: [[2506.21980]] R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning(https://arxiv.org/abs/2506.21980)
Keywords: large language model
Abstract: Visual single object tracking aims to continuously localize and estimate the scale of a target in subsequent video frames, given only its initial state in the first frame. This task has traditionally been framed as a template matching problem, evolving through major phases including correlation filters, two-stream networks, and one-stream networks with significant progress achieved. However, these methods typically require explicit classification and regression modeling, depend on supervised training with large-scale datasets, and are limited to the single task of tracking, lacking flexibility. In recent years, multi-modal large language models (MLLMs) have advanced rapidly. Open-source models like Qwen2.5-VL, a flagship MLLMs with strong foundational capabilities, demonstrate excellent performance in grounding tasks. This has spurred interest in applying such models directly to visual tracking. However, experiments reveal that Qwen2.5-VL struggles with template matching between image pairs (i.e., tracking tasks). Inspired by deepseek-R1, we fine-tuned Qwen2.5-VL using the group relative policy optimization (GRPO) reinforcement learning method on a small-scale dataset with a rule-based reward function. The resulting model, R1-Track, achieved notable performance on the GOT-10k benchmark. R1-Track supports flexible initialization via bounding boxes or text descriptions while retaining most of the original model's general capabilities. And we further discuss potential improvements for R1-Track. This rough technical report summarizes our findings as of May 2025.

Title: Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit

Authors: Kartheek Kumar Reddy Nareddy, Sarah Ternus, Julia Niebling
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21990
Pdf URL: https://arxiv.org/pdf/2506.21990
Copy Paste: [[2506.21990]] Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit(https://arxiv.org/abs/2506.21990)
Keywords: transformer
Abstract: The developments in transformer encoder-decoder architectures have led to significant breakthroughs in machine translation, Automatic Speech Recognition (ASR), and instruction-based chat machines, among other applications. The pre-trained models were trained on vast amounts of generic data over a few epochs (fewer than five in most cases), resulting in their strong generalization capabilities. Nevertheless, the performance of these models does suffer when applied to niche domains like transcribing pilot speech in the cockpit, which involves a lot of specific vocabulary and multilingual conversations. This paper investigates and improves the transcription accuracy of cockpit conversations with Whisper models. We have collected around 85 minutes of cockpit simulator recordings and 130 minutes of interview recordings with pilots and manually labeled them. The speakers are middle aged men speaking both German and English. To improve the accuracy of transcriptions, we propose multiple normalization schemes to refine the transcripts and improve Word Error Rate (WER). We then employ fine-tuning to enhance ASR performance, utilizing performance-efficient fine-tuning with Low-Rank Adaptation (LoRA). Hereby, WER decreased from 68.49 \% (pretrained whisper Large model without normalization baseline) to 26.26\% (finetuned whisper Large model with the proposed normalization scheme).

Title: RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, Abhinav Valada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22007
Pdf URL: https://arxiv.org/pdf/2506.22007
Copy Paste: [[2506.22007]] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation(https://arxiv.org/abs/2506.22007)
Keywords: diffusion
Abstract: We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.

Title: Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method

Authors: Han Wang, Shengyang Li, Jian Yang, Yuxuan Liu, Yixuan Lv, Zhuang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22027
Pdf URL: https://arxiv.org/pdf/2506.22027
Copy Paste: [[2506.22027]] Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method(https://arxiv.org/abs/2506.22027)
Keywords: transformer
Abstract: Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. This approach ensures shorter re-imaging cycles and enables all-weather tracking. HOSS ReID dataset includes images of the same ship captured over extended periods under diverse conditions, using different satellites of different modalities at varying times and angles. Furthermore, we propose a baseline method for cross-modal ship re-identification, TransOSS, which is built on the Vision Transformer architecture. It refines the patch embedding structure to better accommodate cross-modal tasks, incorporates additional embeddings to introduce more reference information, and employs contrastive learning to pre-train on large-scale optical-SAR image pairs, ensuring the model's ability to extract modality-invariant features. Our dataset and baseline method are publicly available on this https URL.

Title: Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation

Authors: Jialei Chen, Xu Zheng, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22032
Pdf URL: https://arxiv.org/pdf/2506.22032
Copy Paste: [[2506.22032]] Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation(https://arxiv.org/abs/2506.22032)
Keywords: segmentation
Abstract: Zero-shot Semantic Segmentation (ZSS) aims to segment both seen and unseen classes using supervision from only seen classes. Beyond adaptation-based methods, distillation-based approaches transfer vision-language alignment of vision-language model, e.g., CLIP, to segmentation models. However, such knowledge transfer remains challenging due to: (1) the difficulty of aligning vision-based features with the textual space, which requires combining spatial precision with vision-language alignment; and (2) the semantic gap between CLIP's global representations and the local, fine-grained features of segmentation models. To address challenge (1), we propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head, like the Chimera in Greek mythology, combining spatial precision with vision-language alignment. Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space. The CSH incorporates a frozen subnetwork and fixed projection layers from the CLIP visual encoder, along with lightweight trainable components. The partial module from CLIP visual encoder, paired with the segmentation model, retains segmentation capability while easing the mapping to CLIP's semantic space. To address challenge (2), we propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token, while gradually reducing the number of features used for alignment as training progresses. Besides, we also use a Semantic Alignment Module (SAM) to further align dense visual features with semantic embeddings extracted from the frozen CLIP text encoder. Experiments on two benchmarks show improvements of 0.9% and 1.2% in hIoU.

Title: Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph Completion

Authors: Ying Zhang, Yu Zhao, Xuhui Sui, Baohang Zhou, Xiangrui Cai, Li Shen, Xiaojie Yuan, Dacheng Tao
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.22036
Pdf URL: https://arxiv.org/pdf/2506.22036
Copy Paste: [[2506.22036]] Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph Completion(https://arxiv.org/abs/2506.22036)
Keywords: robust, federate, diffusion
Abstract: With the increasing multimodal knowledge privatization requirements, multimodal knowledge graphs in different institutes are usually decentralized, lacking of effective collaboration system with both stronger reasoning ability and transmission safety guarantees. In this paper, we propose the Federated Multimodal Knowledge Graph Completion (FedMKGC) task, aiming at training over federated MKGs for better predicting the missing links in clients without sharing sensitive knowledge. We propose a framework named MMFeD3-HidE for addressing multimodal uncertain unavailability and multimodal client heterogeneity challenges of FedMKGC. (1) Inside the clients, our proposed Hyper-modal Imputation Diffusion Embedding model (HidE) recovers the complete multimodal distributions from incomplete entity embeddings constrained by available modalities. (2) Among clients, our proposed Multimodal FeDerated Dual Distillation (MMFeD3) transfers knowledge mutually between clients and the server with logit and feature distillation to improve both global convergence and semantic consistency. We propose a FedMKGC benchmark for a comprehensive evaluation, consisting of a general FedMKGC backbone named MMFedE, datasets with heterogeneous multimodal information, and three groups of constructed baselines. Experiments conducted on our benchmark validate the effectiveness, semantic consistency, and convergence robustness of MMFeD3-HidE.

Title: Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children's Literature Translation

Authors: Delu Kong, Lieve Macken
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22038
Pdf URL: https://arxiv.org/pdf/2506.22038
Copy Paste: [[2506.22038]] Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children's Literature Translation(https://arxiv.org/abs/2506.22038)
Keywords: large language model
Abstract: This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in English-to-Chinese children's literature translation (CLT) from a stylometric perspective. The research constructs a Peter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhythm, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-YiYang, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT.

Title: Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field

Authors: Hong Nie, Fuyuan Cao, Lu Chen, Fengxin Chen, Yuefeng Zou, Jun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22044
Pdf URL: https://arxiv.org/pdf/2506.22044
Copy Paste: [[2506.22044]] Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field(https://arxiv.org/abs/2506.22044)
Keywords: generative
Abstract: Reconstruction and rendering-based talking head synthesis methods achieve high-quality results with strong identity preservation but are limited by their dependence on identity-specific models. Each new identity requires training from scratch, incurring high computational costs and reduced scalability compared to generative model-based approaches. To overcome this limitation, we propose FIAG, a novel 3D speaking head synthesis framework that enables efficient identity-specific adaptation using only a few training footage. FIAG incorporates Global Gaussian Field, which supports the representation of multiple identities within a shared field, and Universal Motion Field, which captures the common motion dynamics across diverse identities. Benefiting from the shared facial structure information encoded in the Global Gaussian Field and the general motion priors learned in the motion field, our framework enables rapid adaptation from canonical identity representations to specific ones with minimal data. Extensive comparative and ablation experiments demonstrate that our method outperforms existing state-of-the-art approaches, validating both the effectiveness and generalizability of the proposed framework. Code is available at: \textit{this https URL}.

Title: GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Authors: Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Yin Lu, Can Yang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22049
Pdf URL: https://arxiv.org/pdf/2506.22049
Copy Paste: [[2506.22049]] GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling(https://arxiv.org/abs/2506.22049)
Keywords: transformer, large language model
Abstract: Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings.

Title: Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs

Authors: Delu Kong, Lieve Macken
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22050
Pdf URL: https://arxiv.org/pdf/2506.22050
Copy Paste: [[2506.22050]] Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs(https://arxiv.org/abs/2506.22050)
Keywords: large language model
Abstract: This study explores Machine Translationese (MTese) -- the linguistic peculiarities of machine translation outputs -- focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.

Title: Lost at the Beginning of Reasoning

Authors: Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22058
Pdf URL: https://arxiv.org/pdf/2506.22058
Copy Paste: [[2506.22058]] Lost at the Beginning of Reasoning(https://arxiv.org/abs/2506.22058)
Keywords: robust, large language model
Abstract: Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final prediction - errors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across two state-of-the-art open-source reasoning model families: DeepSeek-R1 and Qwen3. To address this, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing accuracy. Finally, we introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities, offering a foundation for future research on robust reasoning in LLMs.

Title: MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

Authors: Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22065
Pdf URL: https://arxiv.org/pdf/2506.22065
Copy Paste: [[2506.22065]] MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation(https://arxiv.org/abs/2506.22065)
Keywords: diffusion, transformer
Abstract: Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX's trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal audio encoder and adapter tailored to LTX's temporal structure, enabling precise audio-expression synchronization; and 3. A progressive training strategy combining close-up facial training, half-body synthesis with facial masking, and hand pose integration for enhanced gesture control. Extensive experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.

Title: Transformers are Graph Neural Networks

Authors: Chaitanya K. Joshi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22084
Pdf URL: https://arxiv.org/pdf/2506.22084
Copy Paste: [[2506.22084]] Transformers are Graph Neural Networks(https://arxiv.org/abs/2506.22084)
Keywords: transformer
Abstract: We establish connections between the Transformer architecture, originally introduced for natural language processing, and Graph Neural Networks (GNNs) for representation learning on graphs. We show how Transformers can be viewed as message passing GNNs operating on fully connected graphs of tokens, where the self-attention mechanism capture the relative importance of all tokens w.r.t. each-other, and positional encodings provide hints about sequential ordering or structure. Thus, Transformers are expressive set processing networks that learn relationships among input elements without being constrained by apriori graphs. Despite this mathematical connection to GNNs, Transformers are implemented via dense matrix operations that are significantly more efficient on modern hardware than sparse message passing. This leads to the perspective that Transformers are GNNs currently winning the hardware lottery.

Title: Tied Prototype Model for Few-Shot Medical Image Segmentation

Authors: Hyeongji Kim, Stine Hansen, Michael Kampffmeyer
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22101
Pdf URL: https://arxiv.org/pdf/2506.22101
Copy Paste: [[2506.22101]] Tied Prototype Model for Few-Shot Medical Image Segmentation(https://arxiv.org/abs/2506.22101)
Keywords: segmentation
Abstract: Common prototype-based medical image few-shot segmentation (FSS) methods model foreground and background classes using class-specific prototypes. However, given the high variability of the background, a more promising direction is to focus solely on foreground modeling, treating the background as an anomaly -- an approach introduced by ADNet. Yet, ADNet faces three key limitations: dependence on a single prototype per class, a focus on binary classification, and fixed thresholds that fail to adapt to patient and organ variability. To address these shortcomings, we propose the Tied Prototype Model (TPM), a principled reformulation of ADNet with tied prototype locations for foreground and background distributions. Building on its probabilistic foundation, TPM naturally extends to multiple prototypes and multi-class segmentation while effectively separating non-typical background features. Notably, both extensions lead to improved segmentation accuracy. Finally, we leverage naturally occurring class priors to define an ideal target for adaptive thresholds, boosting segmentation performance. Taken together, TPM provides a fresh perspective on prototype-based FSS for medical image segmentation. The code can be found at this https URL.

Title: Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD

Authors: Ruthvik Bokkasam, Shankar Gangisetty, A. H. Abdul Hafez, C. V. Jawahar
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2506.22111
Pdf URL: https://arxiv.org/pdf/2506.22111
Copy Paste: [[2506.22111]] Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD(https://arxiv.org/abs/2506.22111)
Keywords: robust
Abstract: With the rapid advancements in autonomous driving, accurately predicting pedestrian behavior has become essential for ensuring safety in complex and unpredictable traffic conditions. The growing interest in this challenge highlights the need for comprehensive datasets that capture unstructured environments, enabling the development of more robust prediction models to enhance pedestrian safety and vehicle navigation. In this paper, we introduce an Indian driving pedestrian dataset designed to address the complexities of modeling pedestrian behavior in unstructured environments, such as illumination changes, occlusion of pedestrians, unsignalized scene types and vehicle-pedestrian interactions. The dataset provides high-level and detailed low-level comprehensive annotations focused on pedestrians requiring the ego-vehicle's attention. Evaluation of the state-of-the-art intention prediction methods on our dataset shows a significant performance drop of up to $\mathbf{15\%}$, while trajectory prediction methods underperform with an increase of up to $\mathbf{1208}$ MSE, defeating standard pedestrian datasets. Additionally, we present exhaustive quantitative and qualitative analysis of intention and trajectory baselines. We believe that our dataset will open new challenges for the pedestrian behavior research community to build robust models. Project Page: this https URL

Title: Low-Rank Implicit Neural Representation via Schatten-p Quasi-Norm and Jacobian Regularization

Authors: Zhengyun Cheng, Changhao Wang, Guanwen Zhang, Yi Xu, Wei Zhou, Xiangyang Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22134
Pdf URL: https://arxiv.org/pdf/2506.22134
Copy Paste: [[2506.22134]] Low-Rank Implicit Neural Representation via Schatten-p Quasi-Norm and Jacobian Regularization(https://arxiv.org/abs/2506.22134)
Keywords: interpretability
Abstract: Higher-order tensors are well-suited for representing multi-dimensional data, such as color images and videos. Low-rank tensor representation has become essential in machine learning and computer vision, but existing methods like Tucker decomposition offer flexibility at the expense of interpretability. In contrast, while the CANDECOMP/PARAFAC (CP) decomposition provides a more natural and interpretable tensor structure, obtaining sparse solutions remains challenging. Leveraging the rich properties of CP decomposition, we propose a CP-based low-rank tensor function parameterized by neural networks for implicit neural representation (CP-INR). This approach enables continuous data representation beyond structured grids, fully exploiting the non-linearity of tensor data with theoretical guarantees on excess risk bounds. To achieve a sparse CP decomposition, we introduce a variational form of the Schatten-p quasi-norm and prove its relationship to multilinear rank minimization. For smoothness, we propose a regularization term based on the spectral norm of the Jacobian and Hutchinson's trace estimator. Our proposed smoothness regularization is SVD-free and avoids explicit chain rule derivations. It can serve as an alternative to Total Variation (TV) regularization in image denoising tasks and is naturally applicable to continuous data. Extensive experiments on multi-dimensional data recovery tasks, including image inpainting, denoising, and point cloud upsampling, demonstrate the superiority and versatility of our method compared to state-of-the-art approaches.

Title: Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Authors: Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22139
Pdf URL: https://arxiv.org/pdf/2506.22139
Copy Paste: [[2506.22139]] Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs(https://arxiv.org/abs/2506.22139)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.

Title: RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models

Authors: Ronald Fecso, José Morano, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22149
Pdf URL: https://arxiv.org/pdf/2506.22149
Copy Paste: [[2506.22149]] RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models(https://arxiv.org/abs/2506.22149)
Keywords: robust
Abstract: The rise of imaging techniques such as optical coherence tomography (OCT) and advances in deep learning (DL) have enabled clinicians and researchers to streamline retinal disease staging. A popular DL approach is self-supervised learning (SSL), where models learn from vast amounts of unlabeled data, avoiding costly annotation. SSL has allowed the development of foundation models (FMs), large models that can be used for a variety of downstream tasks. However, existing FMs for OCT, trained solely on image data, lack a comprehensive and robust semantic understanding of images, as evidenced by their downstream performance (especially for complex tasks), and thus require supervised fine-tuning (which may be unfeasible) to better adapt to specific applications and populations. To address this, we propose RetFiner, an SSL vision-language refinement scheme that improves the representations of existing FMs and enables their efficient and direct adaptation to specific populations for improved downstream performance. Our method uses a diverse set of training objectives which take advantage of the rich supervisory signal found in textual data. We tested RetFiner on the retinal FMs RETFound, UrFound, and VisionFM, showing significant improvements in linear probing performance on seven highly diverse OCT classification tasks, with an average increase of 5.8, 3.9, and 2.1 percentage points over their baselines, respectively. Our code and model weights are publicly available at this https URL.

Title: Training Language Model to Critique for Better Refinement

Authors: Tianshu Yu, Chao Xiang, Mingchuan Yang, Pei Ke, Bosi Wen, Cunxiang Wang, Jiale Cheng, Li Zhang, Xinyu Mu, Chuxiong Sun, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22157
Pdf URL: https://arxiv.org/pdf/2506.22157
Copy Paste: [[2506.22157]] Training Language Model to Critique for Better Refinement(https://arxiv.org/abs/2506.22157)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce \textbf{R}efinement-oriented \textbf{C}ritique \textbf{O}ptimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e., dialog generation, summarization, question answering, mathematical reasoning, and code generation, and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method's effectiveness in enhancing LLM critique-refinement loops.

Title: Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition

Authors: Wenhan Wu, Zhishuai Guo, Chen Chen, Hongfei Xue, Aidong Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22179
Pdf URL: https://arxiv.org/pdf/2506.22179
Copy Paste: [[2506.22179]] Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition(https://arxiv.org/abs/2506.22179)
Keywords: robust
Abstract: Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, improving zero-shot action recognition.

Title: Reliability Analysis of Smart Contract Execution Architectures: A Comparative Simulation Study

Authors: Önder Gürcan
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2506.22180
Pdf URL: https://arxiv.org/pdf/2506.22180
Copy Paste: [[2506.22180]] Reliability Analysis of Smart Contract Execution Architectures: A Comparative Simulation Study(https://arxiv.org/abs/2506.22180)
Keywords: secure, security
Abstract: The industrial market continuously needs reliable solutions to secure autonomous systems. Especially as these systems become more complex and interconnected, reliable security solutions are becoming increasingly important. One promising solution to tackle this challenge is using smart contracts designed to meet contractual conditions, avoid malicious errors, secure exchanges, and minimize the need for reliable intermediaries. However, smart contracts are immutable. Moreover, there are different smart contract execution architectures (namely Order-Execute and Execute-Order-Validate) that have different throughputs. In this study, we developed an evaluation model for assessing the security of reliable smart contract execution. We then developed a realistic smart contract enabled IoT energy case study. Finally, we simulate the developed case study to evaluate several smart contract security vulnerabilities reported in the literature. Our results show that the Execute-Order-Validate architecture is more promising regarding reliability and security.

Title: Exploring Modularity of Agentic Systems for Drug Discovery

Authors: Laura van Weesep, Samuel Genheden, Ola Engkvist, Jens Sjölund
Subjects: cs.LG, cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2506.22189
Pdf URL: https://arxiv.org/pdf/2506.22189
Copy Paste: [[2506.22189]] Exploring Modularity of Agentic Systems for Drug Discovery(https://arxiv.org/abs/2506.22189)
Keywords: large language model
Abstract: Large-language models (LLMs) and agentic systems present exciting opportunities to accelerate drug discovery and design. In this study, we critically examine the modularity of LLM-based agentic systems for drug discovery, i.e., whether parts of the agentic system such as the LLM are interchangeable, a topic that has received limited attention in drug discovery applications. We compare the performance of different large language models (LLMs) and the effectiveness of tool-calling agents versus code-generating agents in this domain. Our case study, comparing performance in orchestrating tools for chemistry and drug discovery using an LLM-as-a-judge score, shows that Claude-3.5-Sonnet, Claude-3.7-Sonnet and GPT-4o outperform alternative language models such as Llama-3.1-8B, Llama-3.1-70B, GPT-3.5-Turbo, and Nova-Micro. Although we confirm that code-generating agents outperform the tool-calling ones on average, we show that this is highly question and model dependent. Furthermore, the impact of replacing system prompts is dependent on the specific question asked and the model used, underscoring that -- even in this particular domain -- one cannot just replace language models without considering prompt re-engineering. Our study highlights the necessity of further research into the modularity of agentic systems to enable the development of stable and scalable solutions for real-world problems.

Title: dreaMLearning: Data Compression Assisted Machine Learning

Authors: Xiaobo Zhao, Aaron Hurst, Panagiotis Karras, Daniel E. Lucani
Subjects: cs.LG, cs.IT, eess.SP
Abstract URL: https://arxiv.org/abs/2506.22190
Pdf URL: https://arxiv.org/pdf/2506.22190
Copy Paste: [[2506.22190]] dreaMLearning: Data Compression Assisted Machine Learning(https://arxiv.org/abs/2506.22190)
Keywords: federate
Abstract: Despite rapid advancements, machine learning, particularly deep learning, is hindered by the need for large amounts of labeled data to learn meaningful patterns without overfitting and immense demands for computation and storage, which motivate research into architectures that can achieve good performance with fewer resources. This paper introduces dreaMLearning, a novel framework that enables learning from compressed data without decompression, built upon Entropy-based Generalized Deduplication (EntroGeDe), an entropy-driven lossless compression method that consolidates information into a compact set of representative samples. DreaMLearning accommodates a wide range of data types, tasks, and model architectures. Extensive experiments on regression and classification tasks with tabular and image data demonstrate that dreaMLearning accelerates training by up to 8.8x, reduces memory usage by 10x, and cuts storage by 42%, with a minimal impact on model performance. These advancements enhance diverse ML applications, including distributed and federated learning, and tinyML on resource-constrained edge devices, unlocking new possibilities for efficient and scalable learning.

Title: Robust and Accurate Multi-view 2D/3D Image Registration with Differentiable X-ray Rendering and Dual Cross-view Constraints

Authors: Yuxin Cui, Rui Song, Yibin Li, Max Q.-H. Meng, Zhe Min
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.22191
Pdf URL: https://arxiv.org/pdf/2506.22191
Copy Paste: [[2506.22191]] Robust and Accurate Multi-view 2D/3D Image Registration with Differentiable X-ray Rendering and Dual Cross-view Constraints(https://arxiv.org/abs/2506.22191)
Keywords: robust
Abstract: Robust and accurate 2D/3D registration, which aligns preoperative models with intraoperative images of the same anatomy, is crucial for successful interventional navigation. To mitigate the challenge of a limited field of view in single-image intraoperative scenarios, multi-view 2D/3D registration is required by leveraging multiple intraoperative images. In this paper, we propose a novel multi-view 2D/3D rigid registration approach comprising two stages. In the first stage, a combined loss function is designed, incorporating both the differences between predicted and ground-truth poses and the dissimilarities (e.g., normalized cross-correlation) between simulated and observed intraoperative images. More importantly, additional cross-view training loss terms are introduced for both pose and image losses to explicitly enforce cross-view constraints. In the second stage, test-time optimization is performed to refine the estimated poses from the coarse stage. Our method exploits the mutual constraints of multi-view projection poses to enhance the robustness of the registration process. The proposed framework achieves a mean target registration error (mTRE) of $0.79 \pm 2.17$ mm on six specimens from the DeepFluoro dataset, demonstrating superior performance compared to state-of-the-art registration algorithms.

Title: EFRame: Deeper Reasoning via Exploration-Filtering-Replay Reinforcement Learning Framework

Authors: Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yue Wang, Yuzhi Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22200
Pdf URL: https://arxiv.org/pdf/2506.22200
Copy Paste: [[2506.22200]] EFRame: Deeper Reasoning via Exploration-Filtering-Replay Reinforcement Learning Framework(https://arxiv.org/abs/2506.22200)
Keywords: robust, large language model
Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), an efficient variant of PPO that lowers RL's computational cost, still faces limited exploration, low sample efficiency and instability, constraining its performance on complex reasoning tasks. To address these limitations, we introduce EFRame, an Exploration-Filtering-Replay framework that systematically augments GRPO along three critical dimensions. EFRame performs additional rollouts to explore high-quality trajectories, applies online filtering to eliminate low-quality samples that introduce noise and variance, and leverages experience replay to repeatedly exploit rare but informative samples. EFRame establishes a complete and stable learning cycle, guiding the model through a structured transition from exploration to convergence. Our experiments across a variety of reasoning benchmarks demonstrate that EFRame not only improves the robustness and efficiency of training, but also enables access to deeper reasoning capabilities that remain unattainable under vanilla GRPO. Furthermore, EFRame enables a more fine-grained categorization of training samples, allowing for a deeper analysis of how different types of samples contribute to the learning process in RL. Our code is available at this https URL.

Title: Boosting Classification with Quantum-Inspired Augmentations

Authors: Matthias Tschöpe, Vitor Fortes Rey, Sogo Pierre Sanon, Paul Lukowicz, Nikolaos Palaiodimopoulos, Maximilian Kiefer-Emmanouilidis
Subjects: cs.CV, cond-mat.dis-nn, cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2506.22241
Pdf URL: https://arxiv.org/pdf/2506.22241
Copy Paste: [[2506.22241]] Boosting Classification with Quantum-Inspired Augmentations(https://arxiv.org/abs/2506.22241)
Keywords: privacy
Abstract: Understanding the impact of small quantum gate perturbations, which are common in quantum digital devices but absent in classical computers, is crucial for identifying potential advantages in quantum machine learning. While these perturbations are typically seen as detrimental to quantum computation, they can actually enhance performance by serving as a natural source of data augmentation. Additionally, they can often be efficiently simulated on classical hardware, enabling quantum-inspired approaches to improve classical machine learning methods. In this paper, we investigate random Bloch sphere rotations, which are fundamental SU(2) transformations, as a simple yet effective quantum-inspired data augmentation technique. Unlike conventional augmentations such as flipping, rotating, or cropping, quantum transformations lack intuitive spatial interpretations, making their application to tasks like image classification less straightforward. While common quantum augmentation methods rely on applying quantum models or trainable quanvolutional layers to classical datasets, we focus on the direct application of small-angle Bloch rotations and their effect on classical data. Using the large-scale ImageNet dataset, we demonstrate that our quantum-inspired augmentation method improves image classification performance, increasing Top-1 accuracy by 3%, Top-5 accuracy by 2.5%, and the F$_1$ score from 8% to 12% compared to standard classical augmentation methods. Finally, we examine the use of stronger unitary augmentations. Although these transformations preserve information in principle, they result in visually unrecognizable images with potential applications for privacy computations. However, we show that our augmentation approach and simple SU(2) transformations do not enhance differential privacy and discuss the implications of this limitation.

Title: Projected Compression: Trainable Projection for Efficient Transformer Compression

Authors: Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Maciej Pióro, Jakub Krajewski, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jan Ludziejewski
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22255
Pdf URL: https://arxiv.org/pdf/2506.22255
Copy Paste: [[2506.22255]] Projected Compression: Trainable Projection for Efficient Transformer Compression(https://arxiv.org/abs/2506.22255)
Keywords: transformer, large language model
Abstract: Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose Projected Compression, a novel model compression technique, that reduces model weights by utilizing projection modules. Specifically, we first train additional trainable projections weights and preserve access to all the original model parameters. Subsequently, these projections are merged into a lower-dimensional product matrix, resulting in a reduced-size standard Transformer-based model. Unlike alternative approaches that require additional computational overhead, our method matches the base model's per-token computation step in FLOPs. Experimental results show that Projected Compression outperforms the comparable hard pruning and retraining approach on higher quality models. Moreover, the performance margin scales well with the number of tokens.

Title: Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment

Authors: Rui Xu, Yunke Wang, Yong Luo, Bo Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22283
Pdf URL: https://arxiv.org/pdf/2506.22283
Copy Paste: [[2506.22283]] Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment(https://arxiv.org/abs/2506.22283)
Keywords: large language model
Abstract: Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLM). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing methods, despite requiring no additional training or complex modifications. Its simple yet effective design enables efficient inference while preserving strong performance across tasks.

Title: OutDreamer: Video Outpainting with a Diffusion Transformer

Authors: Linhao Zhong, Fan Li, Yi Huang, Jianzhuang Liu, Renjing Pei, Fenglong Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22298
Pdf URL: https://arxiv.org/pdf/2506.22298
Copy Paste: [[2506.22298]] OutDreamer: Video Outpainting with a Diffusion Transformer(https://arxiv.org/abs/2506.22298)
Keywords: diffusion, transformer
Abstract: Video outpainting is a challenging task that generates new video content by extending beyond the boundaries of an original input video, requiring both temporal and spatial consistency. Many state-of-the-art methods utilize latent diffusion models with U-Net backbones but still struggle to achieve high quality and adaptability in generated content. Diffusion transformers (DiTs) have emerged as a promising alternative because of their superior performance. We introduce OutDreamer, a DiT-based video outpainting framework comprising two main components: an efficient video control branch and a conditional outpainting branch. The efficient video control branch effectively extracts masked video information, while the conditional outpainting branch generates missing content based on these extracted conditions. Additionally, we propose a mask-driven self-attention layer that dynamically integrates the given mask information, further enhancing the model's adaptability to outpainting tasks. Furthermore, we introduce a latent alignment loss to maintain overall consistency both within and between frames. For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content, ensuring temporal consistency across video clips. Extensive evaluations demonstrate that our zero-shot OutDreamer outperforms state-of-the-art zero-shot methods on widely recognized benchmarks.

Title: Weakly-Supervised Domain Adaptation with Proportion-Constrained Pseudo-Labeling

Authors: Takumi Okuo, Shinnosuke Matsuo, Shota Harada, Kiyohito Tanaka, Ryoma Bise
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22301
Pdf URL: https://arxiv.org/pdf/2506.22301
Copy Paste: [[2506.22301]] Weakly-Supervised Domain Adaptation with Proportion-Constrained Pseudo-Labeling(https://arxiv.org/abs/2506.22301)
Keywords: robust
Abstract: Domain shift is a significant challenge in machine learning, particularly in medical applications where data distributions differ across institutions due to variations in data collection practices, equipment, and procedures. This can degrade performance when models trained on source domain data are applied to the target domain. Domain adaptation methods have been widely studied to address this issue, but most struggle when class proportions between the source and target domains differ. In this paper, we propose a weakly-supervised domain adaptation method that leverages class proportion information from the target domain, which is often accessible in medical datasets through prior knowledge or statistical reports. Our method assigns pseudo-labels to the unlabeled target data based on class proportion (called proportion-constrained pseudo-labeling), improving performance without the need for additional annotations. Experiments on two endoscopic datasets demonstrate that our method outperforms semi-supervised domain adaptation techniques, even when 5% of the target domain is labeled. Additionally, the experimental results with noisy proportion labels highlight the robustness of our method, further demonstrating its effectiveness in real-world application scenarios.

Title: Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

Authors: Erkan Turan, Aristotelis Siozopoulos, Maks Ovsjanikov
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.22304
Pdf URL: https://arxiv.org/pdf/2506.22304
Copy Paste: [[2506.22304]] Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling(https://arxiv.org/abs/2506.22304)
Keywords: diffusion, generative
Abstract: Conditional Flow Matching (CFM) offers a simulation-free framework for training continuous-time generative models, bridging diffusion and flow-based approaches. However, sampling from CFM still relies on numerically solving non-linear ODEs which can be computationally expensive and difficult to interpret. Recent alternatives address sampling speed via trajectory straightening, mini-batch coupling or distillation. However, these methods typically do not shed light on the underlying \textit{structure} of the generative process. In this work, we propose to accelerate CFM and introduce an interpretable representation of its dynamics by integrating Koopman operator theory, which models non-linear flows as linear evolution in a learned space of observables. We introduce a decoder-free Koopman-CFM architecture that learns an embedding where the generative dynamics become linear, enabling closed-form, one-step sampling via matrix exponentiation. This results in significant speedups over traditional CFM as demonstrated on controlled 2D datasets and real-world benchmarks, MNIST, Fashion-MNIST (F-MNIST), and the Toronto Face Dataset (TFD). Unlike previous methods, our approach leads to a well-structured Koopman generator, whose spectral properties, eigenvalues, and eigenfunctions offer principled tools for analyzing generative behavior such as temporal scaling, mode stability, and decomposition in Koopman latent space. By combining sampling efficiency with analytical structure, Koopman-enhanced flow matching offers a potential step toward fast and interpretable generative modeling.

Title: Detection of Personal Data in Structured Datasets Using a Large Language Model

Authors: Albert Agisha Ntwali, Luca Rück, Martin Heckmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22305
Pdf URL: https://arxiv.org/pdf/2506.22305
Copy Paste: [[2506.22305]] Detection of Personal Data in Structured Datasets Using a Large Language Model(https://arxiv.org/abs/2506.22305)
Keywords: large language model
Abstract: We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature's name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI, a large synthetic dataset, datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a real-world dataset containing patient information from critical care units. Our findings reveal that detection performance varies significantly depending on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on which it was trained. Performance on the medical dataset MIMIC-Demo-Ext is comparable across all models, with our GPT-4o-based approach clearly outperforming the others. Notably, personal data detection in the Kaggle and OpenML datasets appears to benefit from contextual information. This is evidenced by the poor performance of CASSED and Presidio (both of which do not utilize the context of the dataset) compared to the strong results of our GPT-4o-based approach. We conclude that further progress in this field would greatly benefit from the availability of more real-world datasets containing personal information.

Title: Evaluating Scoring Bias in LLM-as-a-Judge

Authors: Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22316
Pdf URL: https://arxiv.org/pdf/2506.22316
Copy Paste: [[2506.22316]] Evaluating Scoring Bias in LLM-as-a-Judge(https://arxiv.org/abs/2506.22316)
Keywords: fair, large language model
Abstract: The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge'', where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments. Current research on evaluating or mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based evaluations, while systematic investigations into bias in scoring-based evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge as the scores differ when scoring judge models are bias-related perturbed, and provide a well-designed framework to comprehensively evaluate scoring bias. We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics. Our experimental results demonstrate that the scoring stability of existing judge models is disrupted by scoring biases. Further exploratory experiments and discussions provide valuable insights into the design of scoring prompt templates and the mitigation of scoring biases on aspects such as score rubrics, score IDs, and reference answer selection.

Title: Under the Hood of BlotchyQuasar: DLL-Based RAT Campaigns Against Latin America

Authors: Alessio Di Santo
Subjects: cs.CR, cs.CY, cs.NI, cs.OS, cs.PL
Abstract URL: https://arxiv.org/abs/2506.22323
Pdf URL: https://arxiv.org/pdf/2506.22323
Copy Paste: [[2506.22323]] Under the Hood of BlotchyQuasar: DLL-Based RAT Campaigns Against Latin America(https://arxiv.org/abs/2506.22323)
Keywords: security, defense, attack, robust, steal
Abstract: A sophisticated malspam campaign was recently uncovered targeting Latin American countries, with a particular focus on Brazil. This operation utilizes a highly deceptive phishing email to trick users into executing a malicious MSI file, initiating a multi-stage infection. The core of the attack leverages DLL side-loading, where a legitimate executable from Valve Corporation is used to load a trojanized DLL, thereby bypassing standard security defenses. Once active, the malware, a variant of QuasarRAT known as BlotchyQuasar, is capable of a wide range of malicious activities. It is designed to steal sensitive browser-stored credentials and banking information, the latter through fake login windows mimicking well-known Brazilian banks. The threat establishes persistence by modifying the Windows registry , captures user keystrokes through keylogging , and exfiltrates stolen data to a Command-and-Control (C2) server using encrypted payloads. Despite its advanced capabilities, the malware code exhibits signs of rushed development, with inefficiencies and poor error handling that suggest the threat actors prioritized rapid deployment over meticulous design. Nonetheless, the campaign extensive reach and sophisticated mechanisms pose a serious and immediate threat to the targeted regions, underscoring the need for robust cybersecurity defenses.

Title: Less Greedy Equivalence Search

Authors: Adiba Ejaz, Elias Bareinboim
Subjects: cs.LG, cs.AI, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22331
Pdf URL: https://arxiv.org/pdf/2506.22331
Copy Paste: [[2506.22331]] Less Greedy Equivalence Search(https://arxiv.org/abs/2506.22331)
Keywords: robust
Abstract: Greedy Equivalence Search (GES) is a classic score-based algorithm for causal discovery from observational data. In the sample limit, it recovers the Markov equivalence class of graphs that describe the data. Still, it faces two challenges in practice: computational cost and finite-sample accuracy. In this paper, we develop Less Greedy Equivalence Search (LGES), a variant of GES that retains its theoretical guarantees while partially addressing these limitations. LGES modifies the greedy step: rather than always applying the highest-scoring insertion, it avoids edge insertions between variables for which the score implies some conditional independence. This more targeted search yields up to a $10$-fold speed-up and a substantial reduction in structural error relative to GES. Moreover, LGES can guide the search using prior assumptions, while correcting these assumptions when contradicted by the data. Finally, LGES can exploit interventional data to refine the learned observational equivalence class. We prove that LGES recovers the true equivalence class in the sample limit from observational and interventional data, even with misspecified prior assumptions. Experiments demonstrate that LGES outperforms GES and other baselines in speed, accuracy, and robustness to misspecified assumptions. Our code is available at this https URL.

Title: MatChA: Cross-Algorithm Matching with Feature Augmentation

Authors: Paula Carbó Cubero, Alberto Jaenal Gálvez, André Mateus, José Araújo, Patric Jensfelt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22336
Pdf URL: https://arxiv.org/pdf/2506.22336
Copy Paste: [[2506.22336]] MatChA: Cross-Algorithm Matching with Feature Augmentation(https://arxiv.org/abs/2506.22336)
Keywords: extraction
Abstract: State-of-the-art methods fail to solve visual localization in scenarios where different devices use different sparse feature extraction algorithms to obtain keypoints and their corresponding descriptors. Translating feature descriptors is enough to enable matching. However, performance is drastically reduced in cross-feature detector cases, because current solutions assume common keypoints. This means that the same detector has to be used, which is rarely the case in practice when different descriptors are used. The low repeatability of keypoints, in addition to non-discriminatory and non-distinctive descriptors, make the identification of true correspondences extremely challenging. We present the first method tackling this problem, which performs feature descriptor augmentation targeting cross-detector feature matching, and then feature translation to a latent space. We show that our method significantly improves image matching and visual localization in the cross-feature scenario and evaluate the proposed method on several benchmarks.

Title: A Framework for Multi-source Privacy Preserving Epidemic Analysis

Authors: Zihan Guan, Zhiyuan Zhao, Fengwei Tian, Dung Nguyen, Payel Bhattacharjee, Ravi Tandon, B. Aditya Prakash, Anil Vullikanti
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22342
Pdf URL: https://arxiv.org/pdf/2506.22342
Copy Paste: [[2506.22342]] A Framework for Multi-source Privacy Preserving Epidemic Analysis(https://arxiv.org/abs/2506.22342)
Keywords: privacy, protect
Abstract: It is now well understood that diverse datasets provide a lot of value in key epidemiology and public health analyses, such as forecasting and nowcasting, development of epidemic models, evaluation and design of interventions and resource allocation. Some of these datasets are often sensitive, and need adequate privacy protections. There are many models of privacy, but Differential Privacy (DP) has become a de facto standard because of its strong guarantees, without making models about adversaries. In this paper, we develop a framework the integrates deep learning and epidemic models to simultaneously perform epidemic forecasting and learning a mechanistic model of epidemic spread, while incorporating multiple datasets for these analyses, including some with DP guarantees. We demonstrate our framework using a realistic but synthetic financial dataset with DP; such a dataset has not been used in such epidemic analyses. We show that this dataset provides significant value in forecasting and learning an epidemic model, even when used with DP guarantees.

Title: Closing the Performance Gap in Biometric Cryptosystems: A Deeper Analysis on Unlinkable Fuzzy Vaults

Authors: Hans Geißner, Christian Rathgeb
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22347
Pdf URL: https://arxiv.org/pdf/2506.22347
Copy Paste: [[2506.22347]] Closing the Performance Gap in Biometric Cryptosystems: A Deeper Analysis on Unlinkable Fuzzy Vaults(https://arxiv.org/abs/2506.22347)
Keywords: protect, biometric
Abstract: This paper analyses and addresses the performance gap in the fuzzy vault-based \ac{BCS}. We identify unstable error correction capabilities, which are caused by variable feature set sizes and their influence on similarity thresholds, as a key source of performance degradation. This issue is further compounded by information loss introduced through feature type transformations. To address both problems, we propose a novel feature quantization method based on \it{equal frequent intervals}. This method guarantees fixed feature set sizes and supports training-free adaptation to any number of intervals. The proposed approach significantly reduces the performance gap introduced by template protection. Additionally, it integrates seamlessly with existing systems to minimize the negative effects of feature transformation. Experiments on state-of-the-art face, fingerprint, and iris recognition systems confirm that only minimal performance degradation remains, demonstrating the effectiveness of the method across major biometric modalities.

Title: From Ground to Air: Noise Robustness in Vision Transformers and CNNs for Event-Based Vehicle Classification with Potential UAV Applications

Authors: Nouf Almesafri, Hector Figueiredo, Miguel Arana-Catania
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22360
Pdf URL: https://arxiv.org/pdf/2506.22360
Copy Paste: [[2506.22360]] From Ground to Air: Noise Robustness in Vision Transformers and CNNs for Event-Based Vehicle Classification with Potential UAV Applications(https://arxiv.org/abs/2506.22360)
Keywords: robust, transformer
Abstract: This study investigates the performance of the two most relevant computer vision deep learning architectures, Convolutional Neural Network and Vision Transformer, for event-based cameras. These cameras capture scene changes, unlike traditional frame-based cameras with capture static images, and are particularly suited for dynamic environments such as UAVs and autonomous vehicles. The deep learning models studied in this work are ResNet34 and ViT B16, fine-tuned on the GEN1 event-based dataset. The research evaluates and compares these models under both standard conditions and in the presence of simulated noise. Initial evaluations on the clean GEN1 dataset reveal that ResNet34 and ViT B16 achieve accuracies of 88% and 86%, respectively, with ResNet34 showing a slight advantage in classification accuracy. However, the ViT B16 model demonstrates notable robustness, particularly given its pre-training on a smaller dataset. Although this study focuses on ground-based vehicle classification, the methodologies and findings hold significant promise for adaptation to UAV contexts, including aerial object classification and event-based vision systems for aviation-related tasks.

Title: Why Are Parsing Actions for Understanding Message Hierarchies Not Random?

Authors: Daichi Kato, Ryo Ueda, Yusuke Miyao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22366
Pdf URL: https://arxiv.org/pdf/2506.22366
Copy Paste: [[2506.22366]] Why Are Parsing Actions for Understanding Message Hierarchies Not Random?(https://arxiv.org/abs/2506.22366)
Keywords: robust
Abstract: If humans understood language by randomly selecting parsing actions, it might have been necessary to construct a robust symbolic system capable of being interpreted under any hierarchical structure. However, human parsing strategies do not seem to follow such a random pattern. Why is that the case? In fact, a previous study on emergent communication using models with hierarchical biases have reported that agents adopting random parsing strategies$\unicode{x2013}$ones that deviate significantly from human language comprehension$\unicode{x2013}$can achieve high communication accuracy. In this study, we investigate this issue by making two simple and natural modifications to the experimental setup: (I) we use more complex inputs that have hierarchical structures, such that random parsing makes semantic interpretation more difficult, and (II) we incorporate a surprisal-related term, which is known to influence the order of words and characters in natural language, into the objective function. With these changes, we evaluate whether agents employing random parsing strategies still maintain high communication accuracy.

Title: Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems

Authors: Abdulmomen Ghalkha, Zhuojun Tian, Chaouki Ben Issaid, Mehdi Bennis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22374
Pdf URL: https://arxiv.org/pdf/2506.22374
Copy Paste: [[2506.22374]] Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems(https://arxiv.org/abs/2506.22374)
Keywords: federate
Abstract: In large-scale communication systems, increasingly complex scenarios require more intelligent collaboration among edge devices collecting various multimodal sensory data to achieve a more comprehensive understanding of the environment and improve decision-making accuracy. However, conventional federated learning (FL) algorithms typically consider unimodal datasets, require identical model architectures, and fail to leverage the rich information embedded in multimodal data, limiting their applicability to real-world scenarios with diverse modalities and varying client capabilities. To address this issue, we propose Sheaf-DMFL, a novel decentralized multimodal learning framework leveraging sheaf theory to enhance collaboration among devices with diverse modalities. Specifically, each client has a set of local feature encoders for its different modalities, whose outputs are concatenated before passing through a task-specific layer. While encoders for the same modality are trained collaboratively across clients, we capture the intrinsic correlations among clients' task-specific layers using a sheaf-based structure. To further enhance learning capability, we propose an enhanced algorithm named Sheaf-DMFL-Att, which tailors the attention mechanism within each client to capture correlations among different modalities. A rigorous convergence analysis of Sheaf-DMFL-Att is provided, establishing its theoretical guarantees. Extensive simulations are conducted on real-world link blockage prediction and mmWave beamforming scenarios, demonstrate the superiority of the proposed algorithms in such heterogeneous wireless communication systems.

Title: Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation

Authors: Tiankai Chen, Yushu Li, Adam Goodge, Fei Teng, Xulei Yang, Tianrui Li, Xun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22375
Pdf URL: https://arxiv.org/pdf/2506.22375
Copy Paste: [[2506.22375]] Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation(https://arxiv.org/abs/2506.22375)
Keywords: robust
Abstract: Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical. While existing OOD detection methods have shown progress for 2D image data, extending these to 3D environments involves unique obstacles. This paper introduces a training-free framework that leverages Vision-Language Models (VLMs) for effective OOD detection in 3D point clouds. By constructing a graph based on class prototypes and testing data, we exploit the data manifold structure to enhancing the effectiveness of VLMs for 3D OOD detection. We propose a novel Graph Score Propagation (GSP) method that incorporates prompt clustering and self-training negative prompting to improve OOD scoring with VLM. Our method is also adaptable to few-shot scenarios, providing options for practical applications. We demonstrate that GSP consistently outperforms state-of-the-art methods across synthetic and real-world datasets 3D point cloud OOD detection.

Title: Probabilistic Optimality for Inference-time Scaling

Authors: Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei, Qing Li
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22376
Pdf URL: https://arxiv.org/pdf/2506.22376
Copy Paste: [[2506.22376]] Probabilistic Optimality for Inference-time Scaling(https://arxiv.org/abs/2506.22376)
Keywords: large language model
Abstract: Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textsc{OptScale}, a practical algorithm that dynamically determines the optimal number of sampled responses. \textsc{OptScale} employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that \textsc{OptScale} significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning.

Title: Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

Authors: Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22385
Pdf URL: https://arxiv.org/pdf/2506.22385
Copy Paste: [[2506.22385]] Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment(https://arxiv.org/abs/2506.22385)
Keywords: generative, large language model
Abstract: Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.

Title: Towards Distributed Neural Architectures

Authors: Aditya Cowsik, Tianyu He, Andrey Gromov
Subjects: cs.LG, cond-mat.dis-nn, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22389
Pdf URL: https://arxiv.org/pdf/2506.22389
Copy Paste: [[2506.22389]] Towards Distributed Neural Architectures(https://arxiv.org/abs/2506.22389)
Keywords: transformer
Abstract: We introduce and train distributed neural architectures (DNA) in vision and language domains. DNAs are initialized with a proto-architecture that consists of (transformer, MLP, attention, etc.) modules and routers. Any token (or patch) can traverse any series of modules in any order. DNAs are a natural generalization of the sparse methods such as Mixture-of-Experts, Mixture-of-Depths, parameter sharing, etc. Computation and communication patterns of DNA modules are learnt end-to-end during training and depend on the content and context of each token (or patch). These patterns can be shaped by further requirements added to the optimization objective such as compute/memory efficiency or load balancing. We empirically show that (i) trained DNAs are competitive with the dense baselines in both domains and (ii) compute efficiency/parameter sharing can be learnt from data. Next, we analyze the emergent connectivity and computation patterns in the trained DNAs. We find that the paths that tokens take through the models are themselves distributed according to a power-law. We show that some paths (or, equivalently, groups of modules) show emergent specialization. Finally, we demonstrate that models learn to allocate compute and active parameters in an interpretable way.

Title: Multi-View Contrastive Learning for Robust Domain Adaptation in Medical Time Series Analysis

Authors: YongKyung Oh, Alex Bui
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22393
Pdf URL: https://arxiv.org/pdf/2506.22393
Copy Paste: [[2506.22393]] Multi-View Contrastive Learning for Robust Domain Adaptation in Medical Time Series Analysis(https://arxiv.org/abs/2506.22393)
Keywords: robust
Abstract: Adapting machine learning models to medical time series across different domains remains a challenge due to complex temporal dependencies and dynamic distribution shifts. Current approaches often focus on isolated feature representations, limiting their ability to fully capture the intricate temporal dynamics necessary for robust domain adaptation. In this work, we propose a novel framework leveraging multi-view contrastive learning to integrate temporal patterns, derivative-based dynamics, and frequency-domain features. Our method employs independent encoders and a hierarchical fusion mechanism to learn feature-invariant representations that are transferable across domains while preserving temporal coherence. Extensive experiments on diverse medical datasets, including electroencephalogram (EEG), electrocardiogram (ECG), and electromyography (EMG) demonstrate that our approach significantly outperforms state-of-the-art methods in transfer learning tasks. By advancing the robustness and generalizability of machine learning models, our framework offers a practical pathway for deploying reliable AI systems in diverse healthcare settings.

Title: Test-Time Consistency in Vision Language Models

Authors: Shih-Han Chou, Shivam Chandhok, James J. Little, Leonid Sigal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22395
Pdf URL: https://arxiv.org/pdf/2506.22395
Copy Paste: [[2506.22395]] Test-Time Consistency in Vision Language Models(https://arxiv.org/abs/2506.22395)
Keywords: robust
Abstract: Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks, yet they often exhibit inconsistent behavior when faced with semantically equivalent inputs, undermining their reliability and robustness. Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs, despite maintaining high average accuracy. Prior work addresses this issue by modifying model architectures or conducting large-scale fine-tuning on curated datasets. In contrast, we propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training. Our method is entirely post-hoc, model-agnostic, and applicable to any VLM with access to its weights. Given a single test point, we enforce consistent predictions via two complementary objectives: (i) a Cross-Entropy Agreement Loss that aligns predictive distributions across semantically equivalent inputs, and (ii) a Pseudo-Label Consistency Loss that draws outputs toward a self-averaged consensus. Our method is plug-and-play and leverages information from a single test input itself to improve consistency. Experiments on the MM-R3 benchmark show that our framework yields substantial gains in consistency across state-of-the-art models, establishing a new direction for inference-time adaptation in multimodal learning.

Title: QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Authors: Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22396
Pdf URL: https://arxiv.org/pdf/2506.22396
Copy Paste: [[2506.22396]] QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization(https://arxiv.org/abs/2506.22396)
Keywords: large language model
Abstract: Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (<=0.2).

Title: Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL

Authors: Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2506.22401
Pdf URL: https://arxiv.org/pdf/2506.22401
Copy Paste: [[2506.22401]] Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL(https://arxiv.org/abs/2506.22401)
Keywords: transformer
Abstract: Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation -- it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings, which can be extended to the general function approximation setting under appropriate assumptions.

Title: Refining Czech GEC: Insights from a Multi-Experiment Approach

Authors: Petr Pechman, Milan Straka, Jana Straková, Jakub Náplava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22402
Pdf URL: https://arxiv.org/pdf/2506.22402
Copy Paste: [[2506.22402]] Refining Czech GEC: Insights from a Multi-Experiment Approach(https://arxiv.org/abs/2506.22402)
Keywords: transformer, large language model
Abstract: We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on this https URL.

Title: HyperCLOVA X THINK Technical Report

Authors: NAVER Cloud HyperCLOVA X Team
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22403
Pdf URL: https://arxiv.org/pdf/2506.22403
Copy Paste: [[2506.22403]] HyperCLOVA X THINK Technical Report(https://arxiv.org/abs/2506.22403)
Keywords: robust, transformer, large language model
Abstract: We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes. It delivers competitive performance against similarly sized models on Korea-focused benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while preserving robust bilingual consistency and translation quality. In addition, a vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM benchmark, all of which are achieved with substantially lower training compute than existing models of similar sizes. We also present a pruning and distillation technique that will soon be applied to HyperCLOVA X THINK for an open-source and business-friendly foundation model. Altogether, these capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI innovation and a valuable resource for the global research community.

Title: ARMOR: Robust Reinforcement Learning-based Control for UAVs under Physical Attacks

Authors: Pritam Dash, Ethan Chan, Nathan P. Lawrence, Karthik Pattabiraman
Subjects: cs.LG, cs.CR, cs.RO
Abstract URL: https://arxiv.org/abs/2506.22423
Pdf URL: https://arxiv.org/pdf/2506.22423
Copy Paste: [[2506.22423]] ARMOR: Robust Reinforcement Learning-based Control for UAVs under Physical Attacks(https://arxiv.org/abs/2506.22423)
Keywords: attack, robust
Abstract: Unmanned Aerial Vehicles (UAVs) depend on onboard sensors for perception, navigation, and control. However, these sensors are susceptible to physical attacks, such as GPS spoofing, that can corrupt state estimates and lead to unsafe behavior. While reinforcement learning (RL) offers adaptive control capabilities, existing safe RL methods are ineffective against such attacks. We present ARMOR (Adaptive Robust Manipulation-Optimized State Representations), an attack-resilient, model-free RL controller that enables robust UAV operation under adversarial sensor manipulation. Instead of relying on raw sensor observations, ARMOR learns a robust latent representation of the UAV's physical state via a two-stage training framework. In the first stage, a teacher encoder, trained with privileged attack information, generates attack-aware latent states for RL policy training. In the second stage, a student encoder is trained via supervised learning to approximate the teacher's latent states using only historical sensor data, enabling real-world deployment without privileged information. Our experiments show that ARMOR outperforms conventional methods, ensuring UAV safety. Additionally, ARMOR improves generalization to unseen attacks and reduces training cost by eliminating the need for iterative adversarial training.

Title: CLoVE: Personalized Federated Learning through Clustering of Loss Vector Embeddings

Authors: Randeep Bhatia, Nikos Papadis, Murali Kodialam, TV Lakshman, Sayak Chakrabarty
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22427
Pdf URL: https://arxiv.org/pdf/2506.22427
Copy Paste: [[2506.22427]] CLoVE: Personalized Federated Learning through Clustering of Loss Vector Embeddings(https://arxiv.org/abs/2506.22427)
Keywords: robust, federate
Abstract: We propose CLoVE (Clustering of Loss Vector Embeddings), a novel algorithm for Clustered Federated Learning (CFL). In CFL, clients are naturally grouped into clusters based on their data distribution. However, identifying these clusters is challenging, as client assignments are unknown. CLoVE utilizes client embeddings derived from model losses on client data, and leverages the insight that clients in the same cluster share similar loss values, while those in different clusters exhibit distinct loss patterns. Based on these embeddings, CLoVE is able to iteratively identify and separate clients from different clusters and optimize cluster-specific models through federated aggregation. Key advantages of CLoVE over existing CFL algorithms are (1) its simplicity, (2) its applicability to both supervised and unsupervised settings, and (3) the fact that it eliminates the need for near-optimal model initialization, which makes it more robust and better suited for real-world applications. We establish theoretical convergence bounds, showing that CLoVE can recover clusters accurately with high probability in a single round and converges exponentially fast to optimal models in a linear setting. Our comprehensive experiments comparing with a variety of both CFL and generic Personalized Federated Learning (PFL) algorithms on different types of datasets and an extensive array of non-IID settings demonstrate that CLoVE achieves highly accurate cluster recovery in just a few rounds of training, along with state-of-the-art model accuracy, across a variety of both supervised and unsupervised PFL tasks.

Title: Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy

Authors: Yuhao Liu, Tengfei Wang, Fang Liu, Zhenwei Wang, Rynson W.H. Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22432
Pdf URL: https://arxiv.org/pdf/2506.22432
Copy Paste: [[2506.22432]] Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy(https://arxiv.org/abs/2506.22432)
Keywords: diffusion, generative
Abstract: Recent advances in deep generative modeling have unlocked unprecedented opportunities for video synthesis. In real-world applications, however, users often seek tools to faithfully realize their creative editing intentions with precise and consistent control. Despite the progress achieved by existing methods, ensuring fine-grained alignment with user intentions remains an open and challenging problem. In this work, we present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing. Shape-for-Motion achieves this by converting the target object in the input video to a time-consistent mesh, i.e., a 3D proxy, allowing edits to be performed directly on the proxy and then inferred back to the video frames. To simplify the editing process, we design a novel Dual-Propagation Strategy that allows users to perform edits on the 3D mesh of a single frame, and the edits are then automatically propagated to the 3D meshes of the other frames. The 3D meshes for different frames are further projected onto the 2D space to produce the edited geometry and texture renderings, which serve as inputs to a decoupled video diffusion model for generating edited results. Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition. Our approach marks a key step toward high-quality, controllable video editing workflows. Extensive experiments demonstrate the superiority and effectiveness of our approach. Project page: this https URL