2025-06-11

Title: Conservative Bias in Large Language Models: Measuring Relation Predictions

Authors: Toyin Aguda, Erik Wilson, Allan Anzagira, Simerjot Kaur, Charese Smiley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08120
Pdf URL: https://arxiv.org/pdf/2506.08120
Copy Paste: [[2506.08120]] Conservative Bias in Large Language Models: Measuring Relation Predictions(https://arxiv.org/abs/2506.08120)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to No_Relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson's choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.
摘要：大型语言模型（LLMS）在关系提取任务中表现出明显的保守偏见，当不可用时，经常默认为no_relation标签。尽管这种行为有助于防止不正确的关系分配，但我们的分析表明，当未明确包含推理时，它也会导致大量信息丢失。我们从多个提示，数据集和关系类型中系统地评估了这种权衡，介绍了霍布森选择捕获场景的概念，其中模型选择了安全但不信息标签而不是幻觉。我们的发现表明，保守的偏见发生的频率是幻觉的两倍。为了量化这种效果，我们使用Sbert和LLM提示来捕获受约束的提示和由半约束和开放式提示产生的标签中保守偏见行为之间的语义相似性。

Title: QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Authors: Jacob Dineen (1), Aswin RRV (1), Qin Liu (2), Zhikun Xu (1), Xiao Ye (1), Ming Shen (1), Zhaonan Li (1), Shijie Lu (1), Chitta Baral (1), Muhao Chen (2), Ben Zhou (1) ((1) Arizona State University, (2) University of California Davis)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08123
Pdf URL: https://arxiv.org/pdf/2506.08123
Copy Paste: [[2506.08123]] QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA(https://arxiv.org/abs/2506.08123)
Keywords: language model, llm
Abstract: Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.
摘要：大型语言模型与明确的原则（例如帮助，诚实和无害）的一致性对于确保安全可靠的AI系统至关重要。但是，基于标准奖励的对准方法通常将各种反馈崩溃为单个标量奖励，将多个目标纠缠为一个不透明的训练信号，这阻碍了可解释性。在这项工作中，我们介绍了QA-Lign，这是一种自动符号奖励分解方法，该方法保留了奖励机制中每个宪法原理的结构。 Qa-nign没有训练一个输出单片得分的黑框奖励模型，而是提出了特定于特定的评估问题，并为每个原理提供了单独的奖励组件，从而使其成为撤回奖励模型的更换。实验将未经审查的大语言模型与一组宪法原则保持一致，这表明QA-nign在对齐过程中提供了更大的透明度和适应性。同时，我们的方法在与DPO基线相比的表现或更好。总体而言，这些结果代表了朝着更容易解释和可控制的语言模型对准的一步，而无需牺牲终结任务的表现。

Title: EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Authors: Zefang Liu, Yinzhu Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08136
Pdf URL: https://arxiv.org/pdf/2506.08136
Copy Paste: [[2506.08136]] EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments(https://arxiv.org/abs/2506.08136)
Keywords: language model, llm, prompt, agent
Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.
摘要：我们介绍了Econwebarena，这是一种评估在现实的网络环境中复杂，多模式经济任务的自主代理的基准。该基准包括来自82个权威网站的360个精心策划任务，这些网站涵盖了宏观经济，劳动力，金融，贸易和公共政策等领域。每个任务都会挑战代理商以导航实时网站，解释结构化和视觉内容，与真实接口进行交互，并通过多步工作流提取精确的时间敏感数据。我们通过促使多个大语言模型（LLM）生成候选任务，然后进行严格的人类策划，以确保清晰度，可行性和来源可靠性来构建基准。与先前的工作不同，Econwebarena强调了对权威数据源的忠诚以及对基于网络的经济推理的需求。我们评估一组最先进的多模式LLM作为网络代理，分析故障案例并进行消融研究，以评估视觉接地，基于计划的推理和相互作用设计的影响。我们的结果揭示了巨大的性能差距，并突出了基础，导航和多模式理解方面的持续挑战，将Econwebarena定位为经济网络情报的严格测试。

Title: Multilingual Hate Speech Detection in Social Media Using Translation-Based Approaches with Large Language Models

Authors: Muhammad Usman, Muhammad Ahmad, M. Shahiki Tash, Irina Gelbukh, Rolando Quintero Tellez, Grigori Sidorov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08147
Pdf URL: https://arxiv.org/pdf/2506.08147
Copy Paste: [[2506.08147]] Multilingual Hate Speech Detection in Social Media Using Translation-Based Approaches with Large Language Models(https://arxiv.org/abs/2506.08147)
Keywords: language model, gpt, llm
Abstract: Social media platforms are critical spaces for public discourse, shaping opinions and community dynamics, yet their widespread use has amplified harmful content, particularly hate speech, threatening online safety and inclusivity. While hate speech detection has been extensively studied in languages like English and Spanish, Urdu remains underexplored, especially using translation-based approaches. To address this gap, we introduce a trilingual dataset of 10,193 tweets in English (3,834 samples), Urdu (3,197 samples), and Spanish (3,162 samples), collected via keyword filtering, with a balanced distribution of 4,849 Hateful and 5,344 Not-Hateful labels. Our methodology leverages attention layers as a precursor to transformer-based models and large language models (LLMs), enhancing feature extraction for multilingual hate speech detection. For non-transformer models, we use TF-IDF for feature extraction. The dataset is benchmarked using state-of-the-art models, including GPT-3.5 Turbo and Qwen 2.5 72B, alongside traditional machine learning models like SVM and other transformers (e.g., BERT, RoBERTa). Three annotators, following rigorous guidelines, ensured high dataset quality, achieving a Fleiss' Kappa of 0.821. Our approach, integrating attention layers with GPT-3.5 Turbo and Qwen 2.5 72B, achieves strong performance, with macro F1 scores of 0.87 for English (GPT-3.5 Turbo), 0.85 for Spanish (GPT-3.5 Turbo), 0.81 for Urdu (Qwen 2.5 72B), and 0.88 for the joint multilingual model (Qwen 2.5 72B). These results reflect improvements of 8.75% in English (over SVM baseline 0.80), 8.97% in Spanish (over SVM baseline 0.78), 5.19% in Urdu (over SVM baseline 0.77), and 7.32% in the joint multilingual model (over SVM baseline 0.82). Our framework offers a robust solution for multilingual hate speech detection, fostering safer digital communities worldwide.
摘要：社交媒体平台是公开话语，塑造观点和社区动态的关键空间，但它们的广泛使用却扩大了有害内容，尤其是仇恨言论，威胁在线安全性和包容性。虽然仇恨言语检测已经用英语和西班牙语等语言进行了广泛的研究，但乌尔都语仍然没有被忽视，尤其是使用基于翻译的方法。为了解决这一差距，我们介绍了一个由英文（3,834个样本），乌尔都语（3,197个样本）和西班牙语（3,162个样本）的三语数据集，该数据集通过关键字过滤收集，并以4,849个仇恨和5,34444444444444444444444444444444444444444.4444444444444.444444444444.4444444444.44444444.44444444.444444.4444444444.44444444444444.4444444444444444444444444444444444444444444444444444444444444444444444444444.我们的方法学利用注意层是基于变压器的模型和大型语言模型（LLM）的先驱，从而增强了用于多语言仇恨语音检测的功能提取。对于非转化模型，我们使用TF-IDF进行特征提取。该数据集使用最新模型（包括GPT-3.5 Turbo和Qwen 2.5 72B）以及传统的机器学习模型（例如SVM和其他变形金刚（例如Bert，Roberta）等传统机器学习模型，对数据集进行了标准测试。遵循严格的准则，三个注释者确保了高数据集质量，达到了0.821的Fleiss Kappa。 Our approach, integrating attention layers with GPT-3.5 Turbo and Qwen 2.5 72B, achieves strong performance, with macro F1 scores of 0.87 for English (GPT-3.5 Turbo), 0.85 for Spanish (GPT-3.5 Turbo), 0.81 for Urdu (Qwen 2.5 72B), and 0.88 for the joint multilingual model (Qwen 2.5 72B).这些结果反映了英语8.75％（超过SVM基线0.80），西班牙语为8.97％（超过SVM基线0.78），乌尔都语为5.19％（超过SVM基线0.77）和7.32％的联合多语言模型（超过SVM基线0.82）。我们的框架为多语言仇恨语音检测提供了强大的解决方案，从而促进了全球更安全的数字社区。

Title: Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction

Authors: Gerardo Aleman Manzanarez, Nora de la Cruz Arana, Jorge Garcia Flores, Yobany Garcia Medina, Raul Monroy, Nathalie Pernelle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08172
Pdf URL: https://arxiv.org/pdf/2506.08172
Copy Paste: [[2506.08172]] Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction(https://arxiv.org/abs/2506.08172)
Keywords: language model
Abstract: Automated story writing has been a subject of study for over 60 years. Large language models can generate narratively consistent and linguistically coherent short fiction texts. Despite these advancements, rigorous assessment of such outputs for literary merit - especially concerning aesthetic qualities - has received scant attention. In this paper, we address the challenge of evaluating AI-generated microfictions and argue that this task requires consideration of literary criteria across various aspects of the text, such as thematic coherence, textual clarity, interpretive depth, and aesthetic quality. To facilitate this, we present GrAImes: an evaluation protocol grounded in literary theory, specifically drawing from a literary perspective, to offer an objective framework for assessing AI-generated microfiction. Furthermore, we report the results of our validation of the evaluation protocol, as answered by both literature experts and literary enthusiasts. This protocol will serve as a foundation for evaluating automatically generated microfictions and assessing their literary value.
摘要：自动化的故事写作已有60多年的历史。大型语言模型可以在叙事上产生一致和语言连贯的短篇小说文本。尽管取得了这些进步，但对这种文学优点的此类产出的严格评估 - 尤其是关于美学品质的 - 受到了很少的关注。在本文中，我们应对评估AI生成的小框的挑战，并认为此任务需要考虑文本各个方面的文学标准，例如主题连贯性，文本清晰度，解释性深度和审美质量。为了促进这一点，我们提出了Graimes：以文学理论为基础的评估协议，特别是从文学角度借鉴的，以提供评估AI生成的微论的客观框架。此外，我们报告了对评估协议验证的结果，如文学专家和文学爱好者所回答的那样。该协议将作为评估自动生成的小框并评估其文学价值的基础。

Title: LLM-BT: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding

Authors: Li Weigang, Pedro Carvalho Brom
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08174
Pdf URL: https://arxiv.org/pdf/2506.08174
Copy Paste: [[2506.08174]] LLM-BT: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding(https://arxiv.org/abs/2506.08174)
Keywords: language model, gpt, llm
Abstract: The rapid growth of English technical terms challenges traditional expert-driven standardization, especially in fast-evolving fields like AI and quantum computing. Manual methods struggle to ensure multilingual consistency. We propose \textbf{LLM-BT}, a back-translation framework powered by large language models (LLMs) to automate terminology verification and standardization via cross-lingual semantic alignment. Our contributions are: \textbf{(1) Term-Level Consistency Validation:} Using English $\rightarrow$ intermediate language $\rightarrow$ English back-translation, LLM-BT achieves high term consistency across models (e.g., GPT-4, DeepSeek, Grok), with case studies showing over 90\% exact or semantic matches. \textbf{(2) Multi-Path Verification Workflow:} A novel ``Retrieve--Generate--Verify--Optimize'' pipeline integrates serial (e.g., EN $\rightarrow$ ZHcn $\rightarrow$ ZHtw $\rightarrow$ EN) and parallel (e.g., EN $\rightarrow$ Chinese/Portuguese $\rightarrow$ EN) BT routes. BLEU and term accuracy indicate strong cross-lingual robustness (BLEU $>$ 0.45; Portuguese accuracy 100\%). \textbf{(3) Back-Translation as Semantic Embedding:} BT is conceptualized as dynamic semantic embedding, revealing latent meaning trajectories. Unlike static embeddings, LLM-BT provides transparent path-based embeddings shaped by model evolution. LLM-BT transforms back-translation into an active engine for multilingual terminology standardization, enabling human--AI collaboration: machines ensure semantic fidelity, humans guide cultural interpretation. This infrastructure supports terminology governance across scientific and technological fields worldwide.
摘要：英语技术术语的快速增长挑战了传统的专家驱动的标准化，尤其是在AI和量子计算等快速发展的领域。手动方法难以确保多语言的一致性。我们建议\ textbf {llm-bt}，这是一个由大语言模型（LLMS）提供动力的反向翻译框架，以通过跨语言语义对齐方式自动化术语验证和标准化。我们的贡献是：\ textbf {（1）使用英语$ \ rightArrow $中间语言$ \ rightarrow $英语背面转换，llm-bt实现跨模型的高度一致性（例如，GPT-4，GPT-4，DeepSeek，Grok），具有超过90 \ Semmemantic formantic formantic \ Sempemantic noctions。 \ textbf {（2）多路径验证工作流程：}一本小说`````reterive- geterive- genter-gerify- verify- optimize''管道集成了串行（例如，en $ \ rightarrow $ zhcn $ \ rightarrow $ \ rightarrow $ zhtw $ zhtw $ htw $ \ rightArrow $ en）and $ en $ en $ \ rightar $ \ rightar $ \ rightorw $ \ rightar birowow，路线。 BLEU和术语精度表明跨语性鲁棒性（BLEU $> $> $ 0.45；葡萄牙精度100 \％）。 \ textbf {（3）作为语义嵌入的反向翻译：} bt被概念化为动态语义嵌入，揭示了潜在的含义轨迹。与静态嵌入不同，LLM-BT提供了模型演化形成的基于透明的基于路径的嵌入。 LLM-BT将反向翻译转换为多语言术语标准化的主动引擎，使人合作：机器确保语义忠诚，人类指导文化解释。该基础设施支持全球科学和技术领域的术语治理。

Title: Unable to forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length

Authors: Chupei Wang (University of Virginia), Jiaqiu Vince Sun (New York University)
Subjects: cs.CL, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.08184
Pdf URL: https://arxiv.org/pdf/2506.08184
Copy Paste: [[2506.08184]] Unable to forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length(https://arxiv.org/abs/2506.08184)
Keywords: language model, llm, prompt
Abstract: Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.
摘要：大型语言模型（LLM）中的信息检索越来越多地被认为与生成能力相互交织，而不是单纯的查找。尽管通常假定较长的上下文可以改善检索，但内部内在干扰的影响仍在研究中。为了解决这个问题，我们适应了认知科学的主动干扰（PI）范式，在该科学中，早期的信息破坏了对更新更新的回忆。在人类中，对这种干扰的敏感性与工作记忆能力成反比。我们介绍了PI-LLM，该评估依次流式传输语义相关的键值更新和仅查询最终值。尽管这些最终值明确定位在查询之前，但随着干扰的积累，LLM检索准确性将log-liolearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearearialearearialearialearialearearearearearearearearearearearearearearearearial Enclerference累积。错误是由于检索以前覆盖的值而引起的。尝试通过及时工程（例如，指示模型忽略早期输入）减轻干扰的尝试产生有限的成功。这些发现揭示了对LLMS解除干扰和灵活操纵信息的能力的基本限制，这表明工作记忆瓶颈超出了上下文访问。这要求采用方法，以增强模型在检索过程中抑制无关紧要的内容的能力。

Title: "I Wrote, I Paused, I Rewrote" Teaching LLMs to Read Between the Lines of Student Writing

Authors: Samra Zafar, Shaheer Minhas, Syed Ali Hassan Zaidi, Arfa Naeem, Zahra Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08221
Pdf URL: https://arxiv.org/pdf/2506.08221
Copy Paste: [[2506.08221]] "I Wrote, I Paused, I Rewrote" Teaching LLMs to Read Between the Lines of Student Writing(https://arxiv.org/abs/2506.08221)
Keywords: language model, llm
Abstract: Large language models(LLMs) like Gemini are becoming common tools for supporting student writing. But most of their feedback is based only on the final essay missing important context about how that text was written. In this paper, we explore whether using writing process data, collected through keystroke logging and periodic snapshots, can help LLMs give feedback that better reflects how learners think and revise while writing. We built a digital writing tool that captures both what students type and how their essays evolve over time. Twenty students used this tool to write timed essays, which were then evaluated in two ways: (i) LLM generated feedback using both the final essay and the full writing trace, and (ii) After the task, students completed surveys about how useful and relatable they found the feedback. Early results show that learners preferred the process-aware LLM feedback, finding it more in tune with their own thinking. We also found that certain types of edits, like adding new content or reorganizing paragraphs, aligned closely with higher scores in areas like coherence and elaboration. Our findings suggest that making LLMs more aware of the writing process can lead to feedback that feels more meaningful, personal, and supportive.
摘要：Gemini之类的大型语言模型（LLM）正在成为支持学生写作的常见工具。但是，他们的大多数反馈仅基于最终文章缺少有关该文本的写作的重要背景。在本文中，我们探讨了使用编写过程数据是否通过按键记录和定期快照收集，可以帮助LLM提供反馈，以更好地反映学习者在写作时的思考和修订方式。我们构建了一个数字写作工具，该工具既捕捉学生类型又捕捉他们的论文如何随着时间的流逝而发展。二十名学生使用此工具来撰写定时论文，然后通过两种方式对此进行了评估：（i）使用最终论文和完整的写作跟踪同时生成的反馈，以及（ii）任务后，学生完成了有关他们找到反馈的有用和相关性的调查。早期的结果表明，学习者更喜欢流程感知的LLM反馈，从而更符合自己的思维方式。我们还发现，某些类型的编辑，例如添加新内容或重组段落，与连贯性和详细说明等领域的得分紧密一致。我们的发现表明，使LLMS更了解写作过程可能会导致反馈，从而感觉更有意义，个人和支持。

Title: Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions

Authors: Yu-Ang Lee, Guan-Ting Yi, Mei-Yi Liu, Jui-Chao Lu, Guan-Bo Yang, Yun-Nung Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08234
Pdf URL: https://arxiv.org/pdf/2506.08234
Copy Paste: [[2506.08234]] Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions(https://arxiv.org/abs/2506.08234)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at this https URL.
摘要：大型语言模型（LLM）和AI系统的最新进展导致了复杂AI工作流的设计和优化的范式转移。通过集成多个组件，复合AI系统已经越来越擅长执行复杂的任务。但是，随着这些系统的复杂性的增长，不仅要优化单个组件，而且还要优化它们的相互作用，就会出现新的挑战。尽管传统的优化方法，例如监督微调（SFT）和增强学习（RL），但自然语言反馈的兴起却引入了有希望的新方法，尤其是用于优化非差异性系统。本文对优化复合AI系统的最新进展进行了系统的审查，包括数值和基于语言的技术。我们将复合AI系统优化的概念正式化，对几个关键维度进行了对现有方法进行分类，并在这个快速发展的领域中突出了开放的研究挑战和未来方向。此HTTPS URL公开可用调查的论文列表。

Title: Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

Authors: Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar, Zining Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08235
Pdf URL: https://arxiv.org/pdf/2506.08235
Copy Paste: [[2506.08235]] Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning(https://arxiv.org/abs/2506.08235)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs' ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.
摘要：大型语言模型（LLM）越来越多地用于复杂的研究任务，例如文献综述，想法产生和科学论文分析，但它们能够真正理解和处理复杂的研究论文中复杂的关系，例如主张和支持证据之间的逻辑联系在很大程度上尚未得到探索。在这项研究中，我们提出了主张基础，这是评估LLMS在科学索赔证据提取和验证方面的能力的全面基准，这项任务反映了对科学论证的更深入的理解。我们系统地比较了三种受到分歧和征服方法的启发的方法，这些方法在六种不同的LLM中，突出了科学理解中的模型特定优势和劣势。通过评估涉及多个研究领域的300多个索赔证据对，我们揭示了LLMS处理复杂科学内容的能力的重大局限性。我们的结果表明，诸如GPT-4和Claude之类的封闭源模型在精确端始终超过开源对应物，并在索赔证明识别任务中召回。此外，战略性地设计的三通和一对一的提示方法显着提高了LLMS的能力，可以将分散证据与索赔准确联系起来，尽管这是计算成本的增加。索赔台为评估LLM的科学理解提供了新的标准，为诊断工具提供了一种诊断工具，也提供了能够在全长论文中具有更深入，更可靠推理的建筑系统的前进道路。

Title: Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

Authors: Wanjing Anya Ma, Michael Flor, Zuowei Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08260
Pdf URL: https://arxiv.org/pdf/2506.08260
Copy Paste: [[2506.08260]] Automatic Generation of Inference Making Questions for Reading Comprehension Assessments(https://arxiv.org/abs/2506.08260)
Keywords: gpt, llm, prompt, chain-of-thought
Abstract: Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.
摘要：推论是阅读理解（RC）的必不可少但复杂的技能。某些推论需要跨句子解决参考文献，有些推论则依靠使用先验知识来填写未明确写入文本中的细节。诊断RC问题可以帮助教育者为学龄学生提供更有效和有针对性的阅读教学和干预措施。我们介绍了RC的推理类型的分类法，并使用它来分析诊断RC项目库中项目的分布。接下来，我们介绍使用GPT-4O的实验，以通过少量提示为给定的阅读段落生成桥接推理RC项目，并比较带有或没有经过经过经过经过经验链的提示的条件。对生成的项目进行了三个方面的评估：总体项目质量，适当的推理类型和LLM推理，达到高于0.90的高评分者协议。我们的结果表明，GPT-4O产生了93.8％的优质问题，适用于3-12级环境中的运营用途；但是，只有42.6％的生成问题准确地匹配了目标推理类型。我们得出的结论是，将自动物品生成与人类判断相结合，为可扩展的高质量诊断RC评估提供了有希望的途径。

Title: Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Authors: Matteo Cargnelutti, Catherine Brobston, John Hess, Jack Cushman, Kristi Mukk, Aristana Scourtas, Kyle Courtney, Greg Leppert, Amanda Watson, Martha Whitehead, Jonathan Zittrain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08300
Pdf URL: https://arxiv.org/pdf/2506.08300
Copy Paste: [[2506.08300]] Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability(https://arxiv.org/abs/2506.08300)
Keywords: language model, llm
Abstract: Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.
摘要：大型语言模型（LLM）使用数据来了解世界，以产生有意义的相关性和预测。因此，用于培训这些模型的数据集的性质，规模，质量和多样性，或在推理时间支持其工作，对其质量产生直接影响。质量不同的LLM的快速发展和采用使人们重点关注了公开可获得的高质量培训数据的稀缺性，并揭示了迫切需要将这些数据集管理在具有清晰出处链的可持续实践中。为此，这份技术报告介绍了制度书1.0，这是大量通过哈佛图书馆参与Google Books Project（从2006年开始）数字化的公共领域书籍。与哈佛图书馆合作，我们进行了研究，分析并将这些卷处理成具有历史文献的广泛记录的数据集。该分析涵盖了哈佛图书馆作为该项目的一部分进行扫描的整个藏品，最初涵盖了1,075,899卷，用超过250种不同的语言编写，总计约2500亿个令牌。作为本发行版的一部分，已提供了983,004卷或242B代币的OCR提取文本（原始和后处理）以及元数据（书目，源和生成的书目，源和生成），被确定为在公共领域中被确定为公共领域。本报告描述了该项目的目标和方法以及我们执行的分析结果，所有这些都可以使这个历史集合更容易访问，更容易使人和机器都可以过滤，阅读和使用。

Title: Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency

Authors: Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, Tianyi Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08343
Pdf URL: https://arxiv.org/pdf/2506.08343
Copy Paste: [[2506.08343]] Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency(https://arxiv.org/abs/2506.08343)
Keywords: chain-of-thought
Abstract: Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as "Wait" and "Hmm", is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%-51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.
摘要：大型推理模型的最新进展使得逐步推理了复杂的复杂，但通常会引入重大的过度思考，从而导致冗长和冗余输出阻碍效率。在这项研究中，我们检查了诸如“ wait”和“ hmm”之类的代币信号的显式自我反射是高级推理所必需的。我们提出了Nowait，这是一种简单而有效的方法，可以通过抑制这些代币在推断过程中抑制明确的自我反思。在文本，视觉和视频推理任务上进行十个基准测试的大量实验表明，现在，在五个R1式型号系列中，无需损害模型效用，现在将思想链轨迹长度降低了高达27％-51％。现在，现在提供了一个插件解决方案，以实现高效且具有公用事业的多模式推理。

Title: Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

Authors: Yuxuan Zhou, Xien Liu, Chenwei Yan, Chen Ning, Xiao Zhang, Boxun Li, Xiangling Fu, Shijin Wang, Guoping Hu, Yu Wang, Ji Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08349
Pdf URL: https://arxiv.org/pdf/2506.08349
Copy Paste: [[2506.08349]] Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving(https://arxiv.org/abs/2506.08349)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs' medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.
摘要：大型语言模型（LLMS）在各种医学基准上表现出了出色的性能，但是它们在不同认知水平上的能力仍然没有得到充实。受Bloom的分类法的启发，我们提出了一个多认知级别的评估框架，用于评估本研究中医疗领域中的LLM。该框架集成了现有的医疗数据集并介绍针对三个认知水平的任务：初步知识掌握，综合知识应用和基于方案的问题解决。使用此框架，我们系统地评估了来自六个著名家庭的最先进的一般和医疗LLM：Llama，Qwen，Gemma，Phi，Phi，GPT和DeepSeek。我们的发现表明，随着评估模型的认知复杂性增加，模型大小在更高的认知水平的性能中起着更为关键的作用，表现出色。我们的研究强调，需要在更高的认知水平上增强LLMS的医疗能力，并为开发适合现实医疗应用的LLMS提供见解。

Title: DEAL: Disentangling Transformer Head Activations for LLM Steering

Authors: Li-Ming Zhan, Bo Liu, Zexin Lu, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08359
Pdf URL: https://arxiv.org/pdf/2506.08359
Copy Paste: [[2506.08359]] DEAL: Disentangling Transformer Head Activations for LLM Steering(https://arxiv.org/abs/2506.08359)
Keywords: language model, llm
Abstract: Inference-time steering aims to alter the response characteristics of large language models (LLMs) without modifying their underlying parameters. A critical step in this process is the identification of internal modules within LLMs that are associated with the target behavior. However, current approaches to module selection often depend on superficial cues or ad-hoc heuristics, which can result in suboptimal or unintended outcomes. In this work, we propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers. For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces, each quantized with a shared learnable codebook. We assess the behavioral relevance of each head by quantifying the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses using a binary classification metric. This yields a behavioral relevance score that reflects each head discriminative capacity with respect to the target behavior, guiding both selection and importance weighting. Experiments on seven LLMs from two model families and five behavioral steering datasets demonstrate that our method enables more accurate inference-time interventions, achieving superior performance on the truthfulness-steering task. Furthermore, the heads selected by our approach exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.
摘要：推理时间转向旨在改变大语模型（LLM）的响应特征，而无需修改其基本参数。此过程中的关键步骤是识别与目标行为相关的LLM中的内部模块。但是，当前的模块选择方法通常取决于表面提示或临时启发式方法，这可能导致次优或意外的结果。在这项工作中，我们提出了一个原则性的因果关系框架，用于识别变压器中与行为相关的注意力。对于每个头部，我们都会在其注意力激活上训练一个矢量定量的自动编码器（VQ-AE），将潜在空间划分为与行为相关的和行为 - iRretrelevant子空间，每一个都用共享的可学习代码量进行量化。我们通过量化使用二进制分类度量的VQ-AE编码对行为一致性与行为侵入反应的可分离性来评估每个头部的行为相关性。这产生了一个行为相关性得分，反映了目标行为相对于目标行为的每个头部判别能力，从而指导选择和重要性加权。对来自两个模型家族和五个行为转向数据集的七个LLM的实验表明，我们的方法可以更准确地推理时间干预，从而在真实性的任务上实现了卓越的表现。此外，我们的方法选择的头部在跨域真实性的场景中表现出强烈的零拍概括。

Title: CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs

Authors: Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08364
Pdf URL: https://arxiv.org/pdf/2506.08364
Copy Paste: [[2506.08364]] CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs(https://arxiv.org/abs/2506.08364)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Understanding cause and effect relationships remains a formidable challenge for Large Language Models (LLMs), particularly in specialized domains where reasoning requires more than surface-level correlations. Retrieval-Augmented Generation (RAG) improves factual accuracy, but standard RAG pipelines treat evidence as flat context, lacking the structure required to model true causal dependencies. We introduce Causal-Chain RAG (CC-RAG), a novel approach that integrates zero-shot triple extraction and theme-aware graph chaining into the RAG pipeline, enabling structured multi-hop inference. Given a domain specific corpus, CC-RAG constructs a Directed Acyclic Graph (DAG) of triples and uses forward/backward chaining to guide structured answer generation. Experiments on two real-world domains: Bitcoin price fluctuations and Gaucher disease, show that CC-RAG outperforms standard RAG and zero-shot LLMs in chain similarity, information density, and lexical diversity. Both LLM-as-a-Judge and human evaluations consistently favor CC-RAG. Our results demonstrate that explicitly modeling causal structure enables LLMs to generate more accurate and interpretable responses, especially in specialized domains where flat retrieval fails.
摘要：理解因果关系仍然是大型语言模型（LLM）的巨大挑战，尤其是在推理需要比表面级相关性更多的专业领域。检索增强的生成（RAG）提高了事实的准确性，但是标准的RAG管道将证据视为平坦的环境，缺乏建模真正因果关系依赖性所需的结构。我们引入了因果链抹布（CC-rag），这是一种新颖的方法，将零拍的三重提取和主题感知的图形链链整合到抹布管道中，从而实现了结构化的多跳推断。给定域特定的语料库，CC rag构建了<原因，关系，效果>三元>的定向无环形图（DAG），并使用前向/向后链链来指导结构化答案生成。在两个现实世界中的实验：比特币的价格波动和Gaucher病，表明CC rag在链相似性，信息密度和词汇多样性中优于标准抹布和零拍的LLM。 LLM-AS-A-A-a-a-a-a-As-A-As-A-As-Ass Iss进行的评估始终有利于CC-rag。我们的结果表明，显式建模因果结构使LLM可以生成更准确和可解释的响应，尤其是在平坦检索失败的专用域中。

Title: Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding

Authors: Zikai Xiao, Ziyang Wang, Wen Ma, Yan Zhang, Wei Shen, Yan Wang, Luqi Gong, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08371
Pdf URL: https://arxiv.org/pdf/2506.08371
Copy Paste: [[2506.08371]] Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding(https://arxiv.org/abs/2506.08371)
Keywords: language model, llm, long context
Abstract: While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks.
摘要：尽管大型语言模型（LLMS）支持长上下文，但它们在上下文窗口中与性能退化斗争。当前的解决方案会产生过度的培训成本，留下统计行为和具有成本效益的方法。从解码的角度来看，我们确定了后显着性衰减（PSA）现象，其中显着率与长文本性能降解相关。值得注意的是，尽管衰减，黄金令牌仍然在解码空间中占据高级位置。由此激励，我们提出了无训练的位置对比解码（PCD），将人们从渴望的关注引起的逻辑与设计的局部感知注意力的对象进行了对比，从而使模型能够专注于大规模短期训练所带来的成就。通过对长期衰减模拟的分析，我们证明了PCD有效地减轻了注意力评分降解。实验结果表明，PCD可以在长篇小写基准测试上实现最先进的性能。

Title: Draft-based Approximate Inference for LLMs

Authors: Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08373
Pdf URL: https://arxiv.org/pdf/2506.08373
Copy Paste: [[2506.08373]] Draft-based Approximate Inference for LLMs(https://arxiv.org/abs/2506.08373)
Keywords: language model, llm, prompt
Abstract: Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at this https URL.
摘要：由于变形金刚的二次计算和线性记忆复杂性，对长篇小说大语言模型（LLM）的优化推断越来越重要。现有的近似方法，例如键值（KV）缓存下降，稀疏注意和及时压缩，通常依赖于令牌或KV对的粗略预测。我们为近似LLM推理提出了一个新的框架，该框架利用小型草稿模型更准确地预测令牌和KV对的重要性。具体而言，我们介绍了我们提出的框架的两个实例：（i）SpecKV，它利用草案的输出来准确评估每个KV对对于更有效的KV缓存下降的重要性，以及（ii）SpecPC，该SpecPC使用草案的注意力激活来识别和丢弃不重要的提示令牌。据我们所知，这是第一项使用草案模型进行近似LLM推理加速度的工作，将其效用扩展到传统的无损投机解码之外。我们通过理论和经验分析来激励我们的方法，并在草稿和目标模型的注意力模式之间显示出很强的相关性。对长篇文本基准测试的广泛实验表明，我们的方法始终达到的精度比现有基线更高，同时保留了记忆使用，延迟和吞吐量的相同改进。我们的代码可在此HTTPS URL上找到。

Title: EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models

Authors: Tao Zou, Xinghua Zhang, Haiyang Yu, Minzheng Wang, Fei Huang, Yongbin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08375
Pdf URL: https://arxiv.org/pdf/2506.08375
Copy Paste: [[2506.08375]] EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models(https://arxiv.org/abs/2506.08375)
Keywords: language model, llm
Abstract: With the development and widespread application of large language models (LLMs), the new paradigm of "Model as Product" is rapidly evolving, and demands higher capabilities to address complex user needs, often requiring precise workflow execution which involves the accurate understanding of multiple tasks. However, existing benchmarks focusing on single-task environments with limited constraints lack the complexity required to fully reflect real-world scenarios. To bridge this gap, we present the Extremely Complex Instruction Following Benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH not only includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently, but also integrates a variety of constraints, replicating complex operational environments. Furthermore, we propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization to navigate the intricate challenges posed by LLM applications.
摘要：随着大语模型（LLM）的开发和广泛应用，“作为产品”的新范式正在迅速发展，并且需要更高的能力来满足复杂的用户需求，通常需要精确的工作流执行，这涉及对多个任务的准确理解。但是，专注于有限约束的单任务环境的现有基准缺乏完全反映现实世界情景所需的复杂性。为了弥合这一差距，我们介绍了基准（EIFbench）之后的极其复杂的说明，该指令精心制作，以促进对LLM的更现实，更强大的评估。 EIFBENCH不仅包括多任务场景，可以同时进行跨不同任务类型的全面评估，而且还整合了各种约束，并复制复杂的操作环境。此外，我们提出了细分策略优化（SEGPO）算法，以增强LLM准确实现多任务工作流的能力。当对这些极其复杂的说明挑战时，对EIFbench的评估已经公布了现有LLM的大量性能差异。这一发现强调了进行持续优化的必要性，以应对LLM应用程序提出的复杂挑战。

Title: mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks

Authors: Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani
Subjects: cs.CL, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2506.08400
Pdf URL: https://arxiv.org/pdf/2506.08400
Copy Paste: [[2506.08400]] mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks(https://arxiv.org/abs/2506.08400)
Keywords: language model, gpt, llm
Abstract: Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.
摘要：大型语言模型（LLM）在各种任务上表现出令人印象深刻的表现，包括在诸如语音之类的多模式设置中。但是，他们的评估通常仅限于英语和几种高资源语言。对于低资源语言，没有标准化的评估基准。在本文中，我们通过介绍MSTEB来解决这一差距，MSTEB是一种新的基准，该基准在涵盖语言标识，文本分类，问题答案以及对语音和文本方式的各种任务上评估LLM的性能。我们评估了领先的LLM的性能，例如Gemini 2.0 Flash和GPT-4O（音频）以及最先进的开放模型，例如Qwen 2 Audio和Gemma 3 27B。我们的评估表明，高资源和低资源语言之间的性能差距很大，尤其是对于非洲和美洲/大洋洲所说的语言。我们的发现表明，需要更多的投资来解决其在LLMS覆盖范围内的人数不足。

Title: TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Authors: Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen, Nuanqiao Shan, Xiaoqian Liu, Anping Liu, Huajie Liu, Youyan Wang, Wujiuge Yin, Hu Song, Bing Huang, Zhiyuan Xia, Jialiang Chen, Linfeng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08403
Pdf URL: https://arxiv.org/pdf/2506.08403
Copy Paste: [[2506.08403]] TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration(https://arxiv.org/abs/2506.08403)
Keywords: language model, gpt, llm, agent
Abstract: Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at this https URL.
摘要：长期以来，机器翻译一直是自然语言处理的核心任务。随着大语言模型（LLM）的快速发展，翻译质量取得了显着进步。但是，充分意识到LLM的翻译潜力仍然是一个开放的挑战。最近的研究探索了多代理系统，将复杂的翻译任务分解为协作子任务，从而通过代理合作和专业化来提高翻译质量，从而显示出最初的希望。然而，现有的多代理翻译框架在很大程度上忽略了认知翻译研究的基本见解。这些见解强调了人类翻译人员如何采用不同的认知策略，例如平衡文字和自由翻译，基于上下文的精炼表达方式以及迭代评估产出。为了解决这一限制，我们提出了一个称为Tactic的认知知情的多代理框架，该框架代表着以认知性的互动协作为基础。该框架包括六种在功能上不同的药物，它们反映了人类翻译行为中观察到的关键认知过程。这些包括用于制图，改进，评估，评分，上下文推理和外部知识收集的代理。通过模拟互动和理论的翻译工作流，战术有效地利用了LLM的全部容量来进行高质量的翻译。 Flores-200和WMT24基准的不同语言对的实验结果表明，我们的方法始终达到最先进的性能。使用DeepSeek-V3作为基本模型，战术平均超过GPT-4.1 +0.6 XCOMET和+1.18 COMETKIWI-23。与DeepSeek-R1相比，它进一步提高了+0.84 Xcomet和+2.99 Cometkiwi-23。代码可在此HTTPS URL上找到。

Title: Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens

Authors: Ziyang Ma, Qingyue Yuan, Zhenglin Wang, Deyu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08410
Pdf URL: https://arxiv.org/pdf/2506.08410
Copy Paste: [[2506.08410]] Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens(https://arxiv.org/abs/2506.08410)
Keywords: language model, llm, prompt
Abstract: Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.
摘要：先前的研究主要集中在大语言模型（LLMS）的认知错误检测能力上，通常促使他们分析推理链中的错误。但是，很少有研究研究LLM的元认知能力（例如，它们对步骤错误的自我意识），这对于它们的可靠性至关重要。尽管对LLM自我评估的研究提出了一些措施，例如困惑，这些措施可以反映答案正确性并被视为元认知的镜头，但它们缺乏步进级分析和适应性。本文研究了使用当前镜头对LLM元认知的评估以及如何改善这些镜头。具体而言，我们建议Automeco，这是一种自动化的元认知评估框架，用于对现有镜头进行基准测试。此外，提出了一种无训练的马尔可夫固有奖励调整策略MIRA，以增强当前的元认知镜头。三个数学推理数据集和三个LLM的实验结果通过将其与最佳N验证进行比较，显示了汽车的合理性。此外，可以使用MIRA更好地评估LLM的荟萃认知能力。

Title: Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models

Authors: Jiaxiang Liu, Boxuan Xing, Chenhao Yuan, Chenxiang Zhang, Di Wu, Xiusheng Huang, Haida Yu, Chuhan Lang, Pengfei Cao, Jun Zhao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08427
Pdf URL: https://arxiv.org/pdf/2506.08427
Copy Paste: [[2506.08427]] Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models(https://arxiv.org/abs/2506.08427)
Keywords: language model, llm
Abstract: As large language models (LLMs) continue to advance, there is a growing urgency to enhance the interpretability of their internal knowledge mechanisms. Consequently, many interpretation methods have emerged, aiming to unravel the knowledge mechanisms of LLMs from various perspectives. However, current interpretation methods differ in input data formats and interpreting outputs. The tools integrating these methods are only capable of supporting tasks with specific inputs, significantly constraining their practical applications. To address these challenges, we present an open-source Knowledge Mechanisms Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms within LLMs systematically. Specifically, we have developed an extensible core module that can automatically match different input data with interpretation methods and consolidate the interpreting outputs. It enables users to freely choose appropriate interpretation methods based on the inputs, making it easier to comprehensively diagnose the model's internal knowledge mechanisms from multiple perspectives. Our code is available at this https URL. We also provide a demonstration video on this https URL.
摘要：随着大型语言模型（LLM）的继续发展，越来越紧迫地提高了其内部知识机制的解释性。因此，出现了许多解释方法，旨在从各个角度揭示LLM的知识机制。但是，当前的解释方法在输入数据格式和解释输出方面有所不同。集成这些方法的工具只能通过特定输入来支持任务，从而大大限制其实际应用。为了应对这些挑战，我们提出了一种开源知识机制揭示者和解释器（知识-MRI），旨在系统地分析LLMS内的知识机制。具体而言，我们已经开发了一个可扩展的核心模块，该模块可以自动将不同的输入数据与解释方法匹配并巩固解释输出。它使用户能够根据输入自由选择适当的解释方法，从而更容易从多个角度全面诊断模型的内部知识机制。我们的代码可在此HTTPS URL上找到。我们还提供了此HTTPS URL的演示视频。

Title: CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Authors: Ziqi.Liu, Ziyang.Zhou, Mingxuan.Hu
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2506.08430
Pdf URL: https://arxiv.org/pdf/2506.08430
Copy Paste: [[2506.08430]] CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models(https://arxiv.org/abs/2506.08430)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I's state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.
摘要：大型语言模型（LLM）已成为讽刺检测领域的主流方法。但是，现有的LLM方法在讽刺检测中面临着挑战，包括：1。单调局限性，2。全面理解不足和3。缺乏可解释性。本文介绍了讽刺的协作代理框架（CAF-I），这是一个旨在克服这些问题的LLM驱动的多机构系统。 CAF-I使用专门的代理人来进行上下文，语义和修辞，从而进行多维分析并从事交互式协作优化。然后，决策代理可以通过改进评估器代理来巩固这些观点，从而提供有条件的反馈以进行优化。基准数据集上的实验建立了CAF-I的最先进的零击性能。 CAF-I在绝大多数指标上达到SOTA，平均宏观F1的平均宏F1为76.31，比先前最强的基线相比，绝对改善了4.98。通过有效模拟人类样式分析，提高了检测准确性和解释性，可以实现这一成功。

Title: Low-resource domain adaptation while minimizing energy and hardware resource consumption

Authors: Hernán Maina, Nicolás Wolovick, Luciana Benotti
Subjects: cs.CL, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08433
Pdf URL: https://arxiv.org/pdf/2506.08433
Copy Paste: [[2506.08433]] Low-resource domain adaptation while minimizing energy and hardware resource consumption(https://arxiv.org/abs/2506.08433)
Keywords: language model, llm
Abstract: Training Large Language Models (LLMs) is costly in terms of energy, hardware, and annotated data, often resulting in a positionality rooted in predominant cultures and values (Santy et al., 2023). Domain adaptation has emerged as a promising strategy to better align models with diverse cultural and value contexts (Hershcovich et al., 2022), but its computational cost remains a significant barrier, particularly for research groups lacking access to large-scale infrastructure. In this paper, we evaluate how the use of different numerical precisions and data parallelization strategies impacts both training speed (as a proxy to energy and hardware consumption) and model accuracy, with the goal of facilitating domain adaptation in low-resource environments. Our findings are relevant to any setting where energy efficiency, accessibility, or limited hardware availability are key concerns.
摘要：在能量，硬件和注释数据方面，训练大语言模型（LLMS）的成本高昂，通常导致位置植根于主要的文化和价值（Santy等，2023）。领域的适应性已成为一种有希望的策略，可以更好地与具有多种文化和价值背景的模型保持一致（Hershcovich等，2022），但其计算成本仍然是一个重要的障碍，尤其是对于无法获得大型基础架构的研究小组而言。在本文中，我们评估了不同数值精确的使用和数据并行化策略如何影响训练速度（作为能源和硬件消耗的代表）和模型的准确性，目的是促进在低资源环境中适应域的适应性。我们的发现与能源效率，可访问性或有限的硬件可用性是关键问题的任何环境有关。

Title: Olica: Efficient Structured Pruning of Large Language Models without Retraining

Authors: Jiujun He, Huazhen Lin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08436
Pdf URL: https://arxiv.org/pdf/2506.08436
Copy Paste: [[2506.08436]] Olica: Efficient Structured Pruning of Large Language Models without Retraining(https://arxiv.org/abs/2506.08436)
Keywords: language model, llm
Abstract: Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of pruned layers using low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.
摘要：大多数现有的大型语言模型（LLMS）的结构化修剪方法需要大量的计算和数据资源来重新建立损坏的相关性，从而使它们变得过于昂贵。为了解决这个问题，我们提出了一个称为正交分解和线性校准（OLICA）的LLM的修剪框架，这消除了对重新训练的需求。一个关键的观察结果是，多头注意（MHA）层取决于两种类型的基质产品。通过将这些矩阵产品视为统一实体并应用主成分分析（PCA），我们提取最重要的信息来压缩LLM，而无需牺牲准确性或破坏其原始结构。因此，再培训变得不必要。设计了一种快速分解方法，将PCA的复杂性降低了注意力头数量的平方。此外，为了减轻修剪前馈网络（FFN）层引起的错误积累问题，我们引入了一种线性校准方法，以使用低级别矩阵重建修剪层的残余误差。通过利用奇异值分解（SVD）在最小二乘问题的溶液上，这些矩阵是在不需要重新训练的情况下获得的。广泛的实验表明，所提出的OLICA在数据使用，GPU内存和运行时间方面具有有效的效率，同时在多个基准中提供了卓越的性能。

Title: Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning

Authors: Fengjun Pan, Anh Tuan Luu, Xiaobao Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08477
Pdf URL: https://arxiv.org/pdf/2506.08477
Copy Paste: [[2506.08477]] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning(https://arxiv.org/abs/2506.08477)
Keywords: language model, llm, prompt
Abstract: Detecting harmful memes is essential for maintaining the integrity of online environments. However, current approaches often struggle with resource efficiency, flexibility, or explainability, limiting their practical deployment in content moderation systems. To address these challenges, we introduce U-CoT+, a novel framework for harmful meme detection. Instead of relying solely on prompting or fine-tuning multimodal models, we first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions. This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content and enabling resource-efficient harmful meme detection with general large language models (LLMs). Building on these textual descriptions, we further incorporate targeted, interpretable human-crafted guidelines to guide models' reasoning under zero-shot CoT prompting. As such, this framework allows for easy adaptation to different harmfulness detection criteria across platforms, regions, and over time, offering high flexibility and explainability. Extensive experiments on seven benchmark datasets validate the effectiveness of our framework, highlighting its potential for explainable and low-resource harmful meme detection using small-scale LLMs. Codes and data are available at: this https URL.
摘要：检测有害模因对于维持在线环境的完整性至关重要。但是，当前的方法通常在资源效率，灵活性或解释性方面困难，从而限制了其在内容审核系统中的实际部署。为了应对这些挑战，我们引入了U-Cot+，这是一个新颖的有害模因检测框架。我们首先开发了一个高保真模因到文本管道，将视觉模因转换为详细的文本描述，而不是仅依靠促使或微调多模型模型。该设计将模因解释与模因分类的解释，从而避免了对复杂的原始视觉内容的立即推理，并通过一般的大语言模型（LLMS）实现了资源有效的有害模因检测。在这些文本描述的基础上，我们进一步纳入了有针对性的，可解释的人工制作的指南，以指导模型在零射Cot提示下的推理。因此，该框架可以轻松适应跨平台，区域和随着时间的时间的不同有害性检测标准，从而提供高灵活性和解释性。在七个基准数据集上进行的广泛实验验证了我们框架的有效性，强调了其使用小规模LLMS可解释和低资源有害模因检测的潜力。代码和数据可在以下网址提供：此HTTPS URL。

Title: Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$

Authors: Chihiro Taguchi, Seiji Maekawa, Nikita Bhutani
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.08479
Pdf URL: https://arxiv.org/pdf/2506.08479
Copy Paste: [[2506.08479]] Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$(https://arxiv.org/abs/2506.08479)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive-$k$ retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.
摘要：检索增强的生成（RAG）和长篇小写语言模型（LCLMS）均应解决开放域问答中LLM的上下文限制（QA）。但是，检索的最佳外部环境仍然是一个空旷的问题：修复检索尺寸的风险要么浪费令牌或省略关键证据。现有的自适应方法（例如自露式和自行车）依赖于迭代LLM提示并在Factoid QA上表现良好，但是在汇总质量质量质量检查中挣扎，其中最佳上下文大小是未知和可变的。我们提出了自适应-K $检索，这是一种简单有效的单通行方法，它根据查询和候选段落之间的相似性分数的分布自适应地选择了段落的数量。它不需要模型进行微调，额外的LLM推断或对现有Referiever-Reader管道的更改。在Factoid和Contregation QA基准测试中，自适应-K $匹配或均优于固定的-K $基线，而使用的代币则比全文输入少10倍，但仍然可以检索70％的相关段落。它提高了五个LCLM和两个嵌入模型的准确性，强调了动态调整上下文大小会导致更有效，准确的质量质量量。

Title: Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models

Authors: Huixuan Zhang, Xiaojun Wan
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.08480
Pdf URL: https://arxiv.org/pdf/2506.08480
Copy Paste: [[2506.08480]] Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models(https://arxiv.org/abs/2506.08480)
Keywords: prompt
Abstract: Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.
摘要：文本到图像模型通常很难生成与文本提示完全匹配的图像。先前的研究已广泛研究了文本对图像生成中图像文本比对的评估。但是，现有的评估主要集中于与人类评估的一致，忽略了值得信赖的评估框架的其他关键特性。在这项工作中，我们首先确定了可靠评估应解决的两个关键方面。然后，我们从经验上证明，当前的主流评估框架无法完全满足各种指标和模型的这些属性。最后，我们提出了改善图像文本一致性评估的建议。

Title: Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models

Authors: Sumanth Manduru, Carlotta Domeniconi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08487
Pdf URL: https://arxiv.org/pdf/2506.08487
Copy Paste: [[2506.08487]] Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models(https://arxiv.org/abs/2506.08487)
Keywords: language model, llm, prompt
Abstract: The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked "middle tier" between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.
摘要：快速采用小语言模型（SLM）进行设备和资源受限的部署，使我们对其道德风险的理解超过了。据我们所知，我们对跨越0.5至50亿个参数的指令调整的SLM进行了首次大规模审核 - 在Bert级编码器和旗舰LLMS之间被忽视的“中间层”。我们的评估包括QWEN 2.5，Llama 3.2，Gemma 3和Phi家族的九种开源模型。在零射击提示下，使用烧烤基准测试，我们在模棱两可和歧义的上下文中分析了效用和公平性。该评估揭示了三个关键见解。首先，能力和公平不必是拮抗的：PHI模型在表现最小的偏见时达到了超过90％的F1评分，表明可以实现高效和道德的NLP。其次，社会偏见因建筑：QWEN 2.5模型可能看起来很公平，但这通常反映出空虚的中立性，随机猜测或反复的行为，而不是真正的道德一致性。相比之下，美洲驼3.2模型表现出更强的刻板印象偏见，表明过度自信而不是中立。第三，压缩引入了细微的权衡：4位AWQ量化提高了Llama 3.2-3b的模棱两可的设置中的F1分数，但在PHI-4-MINI中与残疾相关的偏见提高了7个百分点以上。这些见解为在要求公平和效率的应用程序中负责部署SLM提供了实用的指导，尤其是使小型企业和资源约束环境受益。

Title: EtiCor++: Towards Understanding Etiquettical Bias in LLMs

Authors: Ashutosh Dwivedi, Siddhant Shivdutt Singh, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.08488
Pdf URL: https://arxiv.org/pdf/2506.08488
Copy Paste: [[2506.08488]] EtiCor++: Towards Understanding Etiquettical Bias in LLMs(https://arxiv.org/abs/2506.08488)
Keywords: llm
Abstract: In recent years, researchers have started analyzing the cultural sensitivity of LLMs. In this respect, Etiquettes have been an active area of research. Etiquettes are region-specific and are an essential part of the culture of a region; hence, it is imperative to make LLMs sensitive to etiquettes. However, there needs to be more resources in evaluating LLMs for their understanding and bias with regard to etiquettes. In this resource paper, we introduce EtiCor++, a corpus of etiquettes worldwide. We introduce different tasks for evaluating LLMs for knowledge about etiquettes across various regions. Further, we introduce various metrics for measuring bias in LLMs. Extensive experimentation with LLMs shows inherent bias towards certain regions.
摘要：近年来，研究人员已经开始分析LLM的文化敏感性。在这方面，礼节一直是研究的积极领域。礼节是特定区域的，是地区文化的重要组成部分。因此，必须使LLM对礼节敏感。但是，在评估LLM的理解和偏见方面，需要有更多资源。在此资源论文中，我们在全球范围内介绍了Eteticor ++。我们介绍了不同的任务，以评估LLM，以了解各个地区的礼节的知识。此外，我们介绍了各种指标，以测量LLM中的偏差。使用LLMS进行广泛的实验表明对某些区域的固有偏见。

Title: Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework

Authors: Xiao Wei, Xiaobao Wang, Ning Zhuang, Chenyang Wang, Longbiao Wang, Jianwu dang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08490
Pdf URL: https://arxiv.org/pdf/2506.08490
Copy Paste: [[2506.08490]] Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework(https://arxiv.org/abs/2506.08490)
Keywords: prompt
Abstract: Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at this https URL.
摘要：意图检测旨在从自然语言输入中确定用户意图，在这种情况下，监督方法在很大程度上依赖于标记的内域（IND）数据并与室外（OOD）意图挣扎，从而限制了其实际适用性。广义意图发现（GID）通过利用未标记的OOD数据来发现新意图而没有其他注释，从而解决了这一点。但是，现有方法仅着眼于无监督数据的聚类，同时忽略了域的适应性。因此，我们从整合旧知识的角度提出了一个由一致性驱动的原型启动框架，用于GID，其中包括一个原型促进框架，用于从外部来源转移旧知识，以及从目标领域学习新知识的层次结构一致性约束。我们进行了广泛的实验，结果表明，我们的方法显着优于所有基线方法，从而实现了最新的结果，这强烈证明了我们方法的有效性和概括。我们的源代码可在此HTTPS URL上公开获得。

Title: DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs

Authors: Arie Cattan, Alon Jacovi, Ori Ram, Jonathan Herzig, Roee Aharoni, Sasha Goldshtein, Eran Ofek, Idan Szpektor, Avi Caciularu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08500
Pdf URL: https://arxiv.org/pdf/2506.08500
Copy Paste: [[2506.08500]] DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs(https://arxiv.org/abs/2506.08500)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains unclear how models should address such discrepancies. In this work, we first propose a novel taxonomy of knowledge conflict types in RAG, along with the desired model behavior for each type. We then introduce CONFLICTS, a high-quality benchmark with expert annotations of conflict types in a realistic RAG setting. CONFLICTS is the first benchmark that enables tracking progress on how models address a wide range of knowledge conflicts. We conduct extensive experiments on this benchmark, showing that LLMs often struggle to appropriately resolve conflicts between sources. While prompting LLMs to explicitly reason about the potential conflict in the retrieved documents significantly improves the quality and appropriateness of their responses, substantial room for improvement in future research remains.
摘要：检索增强发电（RAG）是一种使用相关和最新信息增强大型语言模型（LLM）的常用方法。但是，检索到的来源通常可能包含冲突的信息，尚不清楚模型应如何解决此类差异。在这项工作中，我们首先提出了抹布中知识冲突类型的新型分类法，以及每种类型的所需模型行为。然后，我们引入冲突，这是一种高质量的基准测试，并在现实的抹布环境中对冲突类型进行了专家注释。冲突是第一个基准，可以在模型如何解决广泛的知识冲突上跟踪进度。我们对这个基准进行了广泛的实验，表明LLMS通常很难适当地解决来源之间的冲突。在促使LLMS明确理解检索文件中的潜在冲突的同时，显着提高了其反应的质量和适当性，但仍有很大的改进空间可以改善未来的研究。

Title: Efficient Post-Training Refinement of Latent Reasoning in Large Language Models

Authors: Xinyuan Wang, Dongjie Wang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Sixun Dong, Kunpeng Liu, Yanjie Fu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08552
Pdf URL: https://arxiv.org/pdf/2506.08552
Copy Paste: [[2506.08552]] Efficient Post-Training Refinement of Latent Reasoning in Large Language Models(https://arxiv.org/abs/2506.08552)
Keywords: language model, prompt, chain-of-thought
Abstract: Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model's latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5\% accuracy gain on MathQA without additional training.
摘要：推理是大语言模型中语言理解的关键组成部分。虽然经过思考的链条通过明确的中间步骤提高了性能，但它遭受了足够的令牌开销和固定的推理轨迹，从而阻止了逐步精致。潜在推理的最新进展通过直接在模型的潜在空间中完善内部推理过程来解决这些局限性，而无需产生明确的输出。但是，一个关键的挑战仍然存在：如何有效地更新训练后的推理嵌入，以指导模型朝着更准确的解决方案迈进。为了克服这一挑战，我们提出了一个轻巧的训练后框架，使用两种新型策略来完善潜在的推理轨迹：1）对比度推理反馈，该反馈比较了与强和弱基线的推理嵌入，以通过嵌入增强来推断有效的更新说明； 2）残留的嵌入精炼，通过逐步整合当前和历史梯度，从而稳定更新，从而实现快速但受控的收敛性。对五个推理基准进行了广泛的实验和案例研究，以证明拟议框架的有效性。值得注意的是，没有额外的培训，在MATHQA上获得了5 \％的准确性。

Title: CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling

Authors: Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, Ruishan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08584
Pdf URL: https://arxiv.org/pdf/2506.08584
Copy Paste: [[2506.08584]] CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling(https://arxiv.org/abs/2506.08584)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly proposed for use in mental health support, yet their behavior in realistic counseling scenarios remains largely untested. We introduce CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test LLMs in single-turn counseling. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of responses from GPT-4, LLaMA 3, Gemini, and online human therapists to real patient questions. Each response is rated along six clinically grounded dimensions, with written rationales and span-level annotations. We find that LLMs often outperform online human therapists in perceived quality, but experts frequently flag their outputs for safety concerns such as unauthorized medical advice. Follow-up experiments show that LLM judges consistently overrate model responses and overlook safety issues identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored counseling questions designed to trigger specific model issues. Evaluation across 2,880 responses from eight LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking and improving LLM behavior in high-stakes mental health settings.
摘要：大型语言模型（LLM）越来越多地用于精神健康支持，但在现实的咨询场景中，它们的行为仍然很大程度上未经测试。我们介绍了CounselBench，这是一种由100位精神卫生专业人员开发的大规模基准，以评估和压力测试LLM在单转咨询中。第一个组成部分是顾问贝恩斯 - 顾问，其中包含2,000个专家评估，对GPT-4，Llama 3，Gemini和在线人类治疗师的反应进行了2,000个专家评估。每个响应都沿六个临床扎根的维度进行评分，并具有书面理由和跨度级注释。我们发现，LLM的质量通常优于在线人类治疗师，但专家经常为安全问题（例如未经授权的医疗建议）标记其产出。后续实验表明，LLM法官始终将模型的响应置于人类专家确定的安全问题。为了更直接地探测故障模式，我们构建了CounselBench-ADV，这是一个旨在触发特定模型问题的120个专家咨询问题的对抗数据集。来自八个LLM的2,880个响应的评估揭示了一致的，特定于模型的故障模式。 CounselBench共同建立了一个临床上扎根的框架，用于基准和改善高风险心理健康环境中的LLM行为。

Title: Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models

Authors: Shuzhou Yuan, Ercong Nie, Mario Tawfelis, Helmut Schmid, Hinrich Schütze, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08593
Pdf URL: https://arxiv.org/pdf/2506.08593
Copy Paste: [[2506.08593]] Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models(https://arxiv.org/abs/2506.08593)
Keywords: language model, llm, prompt
Abstract: Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.
摘要：仇恨言论检测是一项具有社会敏感且固有的主观任务，判断通常会根据个人特征而变化。尽管先前的工作已经检查了社会人口统计学因素如何影响注释，但人格特征对大语言模型（LLMS）的影响仍然很大程度上没有探索。在本文中，我们介绍了关于角色提示在仇恨言语分类中的作用的首次全面研究，重点是基于MBTI的特征。人类注释调查证实，MBTI尺寸显着影响标签行为。将其扩展到LLM，我们提示了四个带有MBTI角色的开源模型，并在三个仇恨语音数据集中评估其输出。我们的分析发现了具有角色驱动的实质性变化，包括与地面真理，彼此之间的分歧和logit级偏见的不一致。这些发现突出了需要在基于LLM的注释工作流中仔细定义角色提示，这对公平和与人类价值观的一致性有影响。

Title: RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

Authors: Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08625
Pdf URL: https://arxiv.org/pdf/2506.08625
Copy Paste: [[2506.08625]] RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval(https://arxiv.org/abs/2506.08625)
Keywords: llm
Abstract: Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.
摘要：科学推理不仅需要长链推理过程，还需要针对域特异性术语的知识和对更新发现的适应性。为了应对科学推理的这些挑战，我们引入了加薪，这是一个逐步检索的框架，该框架从野外语料库中检索了逻辑上相关的文档。加薪分为三个步骤：问题分解，逻辑查询产生和逻辑检索。我们观察到，在科学推理基准上，一贯提高其他基准。我们分析说，与其他基线不同，请筹集检索文档，这些文档不仅在领域知识方面相似，而且在逻辑上也更相关。

Title: MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models

Authors: Son The Nguyen, Theja Tulabandhula
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08643
Pdf URL: https://arxiv.org/pdf/2506.08643
Copy Paste: [[2506.08643]] MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models(https://arxiv.org/abs/2506.08643)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used for both open-ended and structured tasks, yet their inference-time behavior is still largely dictated by heuristic decoding strategies such as greedy search, sampling, or reranking. These methods provide limited control and do not explicitly optimize for task-specific objectives. We introduce MEMETRON, a task-agnostic framework that formulates LLM decoding as a discrete black-box optimization problem. MEMETRON leverages hybrid metaheuristic algorithms, GENETRON and ANNETRON, to search the response space, guided by reward models and contextual operations performed by the LLM itself. This approach enables efficient discovery of high-reward responses without requiring model retraining or gradient access. The framework is modular and generalizes across diverse tasks, requiring only a reward function and lightweight prompt templates. We evaluate our framework on the critical human preference alignment task and demonstrate that it significantly outperforms standard decoding and reranking methods, highlighting its potential to improve alignment without model retraining.
摘要：大型语言模型（LLM）越来越多地用于开放式和结构化任务，但是它们的推理时间行为仍然在很大程度上取决于启发式解码策略，例如贪婪的搜索，采样或重新研究。这些方法提供了有限的控制，并且不能明确地针对特定任务的目标进行优化。我们介绍了Memetron，这是一种任务无关的框架，该框架将LLM解码为离散的黑盒优化问题。 Memetron利用Genetron和Annetron的混合元神经算法来搜索响应空间，并在LLM本身执行的奖励模型和上下文操作的指导下。这种方法可以有效地发现高回复的响应，而无需模型再培训或梯度访问。该框架是模块化的，并且在各种任务之间进行了概括，仅需要奖励功能和轻巧的及时模板。我们在关键的人类偏好一致性任务上评估了我们的框架，并证明它的表现明显优于标准解码和重读方法，从而强调了其在没有模型再培训的情况下改善对齐方式的潜力。

Title: TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

Authors: Mingyu Zheng, Zhifan Feng, Jia Wang, Lanrui Wang, Zheng Lin, Yang Hao, Weiping Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08646
Pdf URL: https://arxiv.org/pdf/2506.08646
Copy Paste: [[2506.08646]] TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning(https://arxiv.org/abs/2506.08646)
Keywords: gpt, llm
Abstract: Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at this https URL
摘要：尽管最近基于LLM的数据合成方法取得了值得称赞的进展，但它们在生成表指令调整数据时仍面临两个局限性。首先，他们无法彻底探索桌面的巨大输入空间理解任务，从而导致数据多样性有限。其次，他们忽略了桌面LLM的理解能力中的弱点，而盲目地追求数据数量的增加，从而导致了次优的数据效率。在本文中，我们介绍了一个渐进型和弱点引导的数据合成框架，该框架量身定制了用于表格指令调整的TableDreamer，以减轻上述问题。具体而言，我们首先将各种表和相关指令综合为种子数据，然后在新确定的弱点数据的指导下对输入空间进行迭代探索，最终充当微调目标LLM的最终培训数据。对10个表格基准测试的广泛实验证明了该框架的有效性，这将Llama3.1-8B教学的平均准确性提高了11.62％（49.07％至60.69％），并使用27K GPT-GPT-4O-4O，并使用27K GPT-4O的同步数据，并超越了态度的纳入图案数据合成基础，从而使用更多培训数据。该代码和数据可在此HTTPS URL上获得

Title: Summarization for Generative Relation Extraction in the Microbiome Domain

Authors: Oumaima El Khettari, Solen Quiniou, Samuel Chaffron
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08647
Pdf URL: https://arxiv.org/pdf/2506.08647
Copy Paste: [[2506.08647]] Summarization for Generative Relation Extraction in the Microbiome Domain(https://arxiv.org/abs/2506.08647)
Keywords: language model, llm
Abstract: We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.
摘要：我们探讨了针对研究肠道微生物组中相互作用的生成关系提取（RE）管道，这是一个复杂且低资源的生物医学结构域。我们的方法利用大语言模型（LLM）摘要在通过指导调节的生成提取关系之前先精炼上下文。专用语料库的初步结果表明，汇总通过减少噪声和指导模型来改善生成性能。但是，基于BERT的RE方法仍然超过了生成模型。这项正在进行的工作证明了生成方法支持在低资源环境中研究专业领域的潜力。

Title: Brevity is the soul of sustainability: Characterizing LLM response lengths

Authors: Soham Poddar, Paramita Koley, Janardan Misra, Sanjay Podder, Navveen Balani, Niloy Ganguly, Saptarshi Ghosh
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.08686
Pdf URL: https://arxiv.org/pdf/2506.08686
Copy Paste: [[2506.08686]] Brevity is the soul of sustainability: Characterizing LLM response lengths(https://arxiv.org/abs/2506.08686)
Keywords: language model, llm, prompt
Abstract: A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies. Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60\% by reducing the response length while preserving the quality of LLM responses.
摘要：大语言模型（LLM）消耗的很大一部分能量来自其推理过程。因此，开发推理节能方法至关重要。尽管存在一些用于推理优化的技术，但输出压缩仍然相对尚未探索，只有少数初步的努力解决了这一方面。在这项工作中，我们首先在5个数据集中基准了12个仅解码器的LLM，这表明这些模型通常会产生比必要的更长的响应。然后，我们对LLM响应进行了全面的质量评估，正式定义了LLM响应中存在的六个信息类别。我们表明，除了最小的答案外，LLM通常倾向于包括多余或其他信息。为了解决LLM的长期响应问题，我们探讨了几种简单，直观的及时工程策略。经验评估表明，针对长度降低和控制信息内容的适当提示可以通过减少响应长度来维护LLM响应的质量，从而实现25-60 \％之间的显着能量优化。

Title: ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Authors: Ruiran Su, Jiasheng Si, Zhijiang Guo, Janet B. Pierrehumbert
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.08700
Pdf URL: https://arxiv.org/pdf/2506.08700
Copy Paste: [[2506.08700]] ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts(https://arxiv.org/abs/2506.08700)
Keywords: language model
Abstract: Scientific fact-checking has mostly focused on text and tables, overlooking scientific charts, which are key for presenting quantitative evidence and statistical reasoning. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking using expert-curated scientific charts. ClimateViz contains 49,862 claims linked to 2,896 visualizations, each labeled as support, refute, or not enough information. To improve interpretability, each example includes structured knowledge graph explanations covering trends, comparisons, and causal relations. We evaluate state-of-the-art multimodal language models, including both proprietary and open-source systems, in zero-shot and few-shot settings. Results show that current models struggle with chart-based reasoning: even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to 77.8 percent accuracy in label-only settings, far below human performance (89.3 and 92.7 percent). Explanation-augmented outputs improve performance in some models. We released our dataset and code alongside the paper.
摘要：科学的事实检查主要集中在文本和表上，忽略了科学图表，这是提供定量证据和统计推理的关键。我们介绍了Climateviz，这是第一个使用专家策划的科学图表进行科学事实检查的大规模基准。气候维兹包含与2,896个可视化相关的49,862个索赔，每个索赔都标记为支持，驳斥或不足的信息。为了提高可解释性，每个示例包括结构化知识图解释，涵盖了趋势，比较和因果关系。我们以零射击和少量设置评估了最新的多模式模型，包括专有和开源系统。结果表明，当前的模型与基于图表的推理相加困难：即使是Gemini 2.5和Internvl 2.5等最佳系统，在仅贴标签的设置中，仅达到76.2％至77.8％的精度，远低于人类绩效（89.3和92.7％）。解释增强的输出可改善某些模型的性能。我们在论文旁边发布了数据集和代码。

Title: ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization

Authors: Hee Suk Yoon, Eunseop Yoon, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08712
Pdf URL: https://arxiv.org/pdf/2506.08712
Copy Paste: [[2506.08712]] ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization(https://arxiv.org/abs/2506.08712)
Keywords: language model, llm
Abstract: We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.
摘要：我们介绍了Confpo，这是一种在大型语言模型（LLMS）中偏爱学习的方法，该方法仅根据培训政策的信心来识别和优化偏好关键令牌，而无需任何辅助模型或计算。与先前的直接比对算法（DAAS）（例如直接偏好优化（DPO））不同，它均匀地调整了所有令牌概率，无论其与偏好的相关性如何，Confpo都集中在最有影响力的代币上。这种有针对性的方法可以通过更有效地使用KL Divergence预算来改善过度优化（即奖励黑客），从而提高对齐质量。与依赖信用代码模型或AI注释者的最新令牌方法相反，对可伸缩性和可靠性提出了担忧，Confpo简单，轻巧且无模型。关于挑战的对齐基准（包括alpacaeval 2和竞技场）的实验结果表明，康斯波始终优于各个LLM的均匀DAA，提供更好的对齐方式，与零额外的计算机开销。

Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Authors: Fariz Ikhwantri, Dusica Marijan
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2506.08713
Pdf URL: https://arxiv.org/pdf/2506.08713
Copy Paste: [[2506.08713]] Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure(https://arxiv.org/abs/2506.08713)
Keywords: language model, llm
Abstract: Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.
摘要：确保复杂的系统符合法规通常需要通过索赔辩论的证据框架来检查保证案件的有效性。此过程中的一些挑战包括法律和技术文本的复杂性质，对模型解释的需求以及对保证案例数据的访问有限。我们提出了一种基于自然语言推论（NLI）的合规性检测方法：可解释的合规性检测，并通过对多跳的推理的论证推断（Apaim）。我们将保证案例的索赔权证明结构制定为可解释和可追溯合规性检测的多跳推断。我们通过使用大型语言模型（LLM）生成保证案例数量有限。我们介绍了测量覆盖范围和结构一致性的指标。我们证明了从多跳推理任务中，从GDPR要求作为案例研究中生成的保证案例的有效性。我们的结果突出了基于NLI的方法在自动化监管合规过程中的潜力。

Title: Improved LLM Agents for Financial Document Question Answering

Authors: Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08726
Pdf URL: https://arxiv.org/pdf/2506.08726
Copy Paste: [[2506.08726]] Improved LLM Agents for Financial Document Question Answering(https://arxiv.org/abs/2506.08726)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.
摘要：大型语言模型（LLM）在众多自然语言处理任务上表现出令人印象深刻的功能。但是，LLM仍在为包括表格和文本数据在内的财务文件的数字问题回答。最近的作品表明了给定标签对此任务的评论家（即自校正）的有效性。在此框架的基础上，本文研究了甲骨文标签不可用时，并通过实验表明，在这种情况下，该评论家的表现会恶化，并证明了传统评论家的有效性。考虑到这一点，我们提出了一位改进的评论家，以及计算剂的表现优于先前的最新方法（经营计划），并且更安全。此外，我们研究了代理人如何相互作用，以及这种相互作用如何影响他们的性能。

Title: Towards Secure and Private Language Models for Nuclear Power Plants

Authors: Muhammad Anwar, Mishca de Costa, Issam Hammad, Daniel Lau
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08746
Pdf URL: https://arxiv.org/pdf/2506.08746
Copy Paste: [[2506.08746]] Towards Secure and Private Language Models for Nuclear Power Plants(https://arxiv.org/abs/2506.08746)
Keywords: language model, llm
Abstract: This paper introduces a domain-specific Large Language Model for nuclear applications, built from the publicly accessible Essential CANDU textbook. Drawing on a compact Transformer-based architecture, the model is trained on a single GPU to protect the sensitive data inherent in nuclear operations. Despite relying on a relatively small dataset, it shows encouraging signs of capturing specialized nuclear vocabulary, though the generated text sometimes lacks syntactic coherence. By focusing exclusively on nuclear content, this approach demonstrates the feasibility of in-house LLM solutions that align with rigorous cybersecurity and data confidentiality standards. Early successes in text generation underscore the model's utility for specialized tasks, while also revealing the need for richer corpora, more sophisticated preprocessing, and instruction fine-tuning to enhance domain accuracy. Future directions include extending the dataset to cover diverse nuclear subtopics, refining tokenization to reduce noise, and systematically evaluating the model's readiness for real-world applications in nuclear domain.
摘要：本文介绍了一种针对核应用的特定领域的大语言模型，该模型是由公共可访问的必需Candu教科书构建的。该模型在基于紧凑的变压器架构上，在单个GPU上进行了训练，以保护核操作中固有的敏感数据。尽管依靠一个相对较小的数据集，但它显示出令人鼓舞的迹象，即捕获专业的核词汇，尽管产生的文本有时缺乏句法连贯性。通过仅专注于核含量，这种方法证明了内部LLM解决方案与严格的网络安全和数据机密性标准的可行性。文本生成的早期成功强调了该模型对专业任务的实用性，同时还揭示了对富裕语料库的需求，更复杂的预处理和指导进行了微调以提高域精度。未来的方向包括扩展数据集以涵盖各种核子主题，完善令牌化以减少噪声，并系统地评估该模型对核域中现实世界应用的准备就绪。

Title: Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data

Authors: Muhammad Anwar, Daniel Lau, Mishca de Costa, Issam Hammad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08750
Pdf URL: https://arxiv.org/pdf/2506.08750
Copy Paste: [[2506.08750]] Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data(https://arxiv.org/abs/2506.08750)
Keywords: language model, llm
Abstract: The nuclear industry possesses a wealth of valuable information locked away in unstructured text data. This data, however, is not readily usable for advanced Large Language Model (LLM) applications that require clean, structured question-answer pairs for tasks like model training, fine-tuning, and evaluation. This paper explores how synthetic data generation can bridge this gap, enabling the development of robust LLMs for the nuclear domain. We discuss the challenges of data scarcity and privacy concerns inherent in the nuclear industry and how synthetic data provides a solution by transforming existing text data into usable Q&A pairs. This approach leverages LLMs to analyze text, extract key information, generate relevant questions, and evaluate the quality of the resulting synthetic dataset. By unlocking the potential of LLMs in the nuclear industry, synthetic data can pave the way for improved information retrieval, enhanced knowledge sharing, and more informed decision-making in this critical sector.
摘要：核工业拥有大量在非结构化文本数据中锁定的有价值的信息。但是，此数据不容易用于高级大语言模型（LLM）应用程序，这些应用程序需要清理，结构化的问题 - 答案对，以进行模型培训，微调和评估等任务。本文探讨了合成数据生成如何弥合这一差距，从而为核域的强大LLM开发。我们讨论了核工业固有的数据稀缺和隐私问题的挑战，以及合成数据如何通过将现有文本数据转换为可用的问答对来提供解决方案。此方法利用LLM分析文本，提取关键信息，产生相关问题并评估所得合成数据集的质量。通过释放LLM在核工业中的潜力，合成数据可以为改善信息检索，增强知识共享以及在这个关键部门提供更明智的决策铺平道路。

Title: Factors affecting the in-context learning abilities of LLMs for dialogue state tracking

Authors: Pradyoth Hegde, Santosh Kesiraju, Jan Švec, Šimon Sedláček, Bolaji Yusuf, Oldřich Plchot, Deepak K T, Jan Černocký
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08753
Pdf URL: https://arxiv.org/pdf/2506.08753
Copy Paste: [[2506.08753]] Factors affecting the in-context learning abilities of LLMs for dialogue state tracking(https://arxiv.org/abs/2506.08753)
Keywords: llm, prompt
Abstract: This study explores the application of in-context learning (ICL) to the dialogue state tracking (DST) problem and investigates the factors that influence its effectiveness. We use a sentence embedding based k-nearest neighbour method to retrieve the suitable demonstrations for ICL. The selected demonstrations, along with the test samples, are structured within a template as input to the LLM. We then conduct a systematic study to analyse the impact of factors related to demonstration selection and prompt context on DST performance. This work is conducted using the MultiWoZ2.4 dataset and focuses primarily on the OLMo-7B-instruct, Mistral-7B-Instruct-v0.3, and Llama3.2-3B-Instruct models. Our findings provide several useful insights on in-context learning abilities of LLMs for dialogue state tracking.
摘要：这项研究探讨了在对话状态跟踪（DST）问题上的应用中文化学习（ICL）的应用，并研究了影响其有效性的因素。我们使用基于嵌入的k-nearest邻居方法来检索ICL的合适演示。所选的演示以及测试样本是在模板中构成的，作为输入LLM。然后，我们进行了一项系统研究，以分析与演示选择和迅速上下文对DST性能有关的因素的影响。这项工作是使用MultiWoz2.4数据集进行的，主要集中在Olmo-7b-Instruct，Mistral-7B-Instruct-V0.3和Llama3.2-3B-Instruct模型上。我们的发现为对话状态跟踪的LLMS中文字学习能力提供了一些有用的见解。

Title: Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL

Authors: Mishca de Costa, Muhammad Anwar, Dave Mercier, Mark Randall, Issam Hammad
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08757
Pdf URL: https://arxiv.org/pdf/2506.08757
Copy Paste: [[2506.08757]] Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL(https://arxiv.org/abs/2506.08757)
Keywords: language model, llm
Abstract: Retrieving operational data from nuclear power plants requires exceptional accuracy and transparency due to the criticality of the decisions it supports. Traditionally, natural language to SQL (NL-to-SQL) approaches have been explored for querying such data. While NL-to-SQL promises ease of use, it poses significant risks: end-users cannot easily validate generated SQL queries, and legacy nuclear plant databases -- often complex and poorly structured -- complicate query generation due to decades of incremental modifications. These challenges increase the likelihood of inaccuracies and reduce trust in the approach. In this work, we propose an alternative paradigm: leveraging function-calling large language models (LLMs) to address these challenges. Instead of directly generating SQL queries, we define a set of pre-approved, purpose-specific functions representing common use cases. Queries are processed by invoking these functions, which encapsulate validated SQL logic. This hybrid approach mitigates the risks associated with direct NL-to-SQL translations by ensuring that SQL queries are reviewed and optimized by experts before deployment. While this strategy introduces the upfront cost of developing and maintaining the function library, we demonstrate how NL-to-SQL tools can assist in the initial generation of function code, allowing experts to focus on validation rather than creation. Our study includes a performance comparison between direct NL-to-SQL generation and the proposed function-based approach, highlighting improvements in accuracy and maintainability. This work underscores the importance of balancing user accessibility with operational safety and provides a novel, actionable framework for robust data retrieval in critical systems.
摘要：从核电站检索运营数据需要出色的准确性和透明度，这是由于其支持的决策的关键性。传统上，已经探索了用于查询此类数据的自然语言（NL-TO-SQL）方法。尽管NL到SQL有望易于使用，但它带来了重大的风险：最终用户无法轻易验证生成的SQL查询，而传统的核电站数据库（通常是复杂且结构不足 - 由于数十年的增量修改而引起的查询产生并复杂化。这些挑战增加了不准确的可能性，并减少了对方法的信任。在这项工作中，我们提出了一种替代范式：利用功能呼叫大语模型（LLMS）来应对这些挑战。我们不是直接生成SQL查询，而是定义一组代表常见用例的预先批准的特定于目的的功能。查询是通过调用这些功能来处理的，这些功能封装了经过验证的SQL逻辑。这种混合方法通过确保在部署前由专家对SQL查询进行审查和优化，从而减轻与直接NL到SQL翻译相关的风险。尽管此策略介绍了开发和维护功能库的前期成本，但我们演示了NL-to-SQL工具如何有助于初始生成功能代码，从而使专家可以专注于验证而不是创建。我们的研究包括直接NL到SQL生成与提出的基于功能的方法之间的性能比较，强调了准确性和可维护性的提高。这项工作强调了平衡用户可访问性与操作安全性的重要性，并为关键系统中的可靠数据检索提供了一个新颖的可行框架。

Title: AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

Authors: Ahmed Hasanaath, Aisha Alansari, Ahmed Ashraf, Chafik Salmane, Hamzah Luqman, Saad Ezzini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08768
Pdf URL: https://arxiv.org/pdf/2506.08768
Copy Paste: [[2506.08768]] AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP(https://arxiv.org/abs/2506.08768)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at this https URL
摘要：大型语言模型（LLMS）在推理能力和一般自然语言处理（NLP）任务方面表现出了显着的进展，但是它们在阿拉伯数据上的表现，其特征是丰富的形态，多样的方言和复杂的脚本，但仍未得到充实。本文对多种以推理为重点的LLM进行了全面的基准测试研究，特别强调了新引入的DeepSeek模型，其中包括15个阿拉伯语NLP任务。我们尝试各种策略，包括零拍，很少射击和微调。这使我们能够系统地评估涵盖一系列应用程序的数据集上的性能，以检查其在不同级别的复杂性下的语言推理能力。我们的实验揭示了几个关键发现。首先，仔细选择三个封闭式示例，在分类任务 - 提高情感分析中的平均升高从35.3％到87.5％，并从56.1％提高到87.0％。其次，以推理为重点的DeepSeek体系结构在零拍设置中的复杂推理任务上平均超过了强大的GPT O4-Mini基线。第三，与模型量表的同等增加相比，基于洛拉的微型调整在F1和BLEU中的额外高达8点。该代码可在此HTTPS URL上找到

Title: The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation

Authors: Francisco Vargas, Alejandro González Coene, Gaston Escalante, Exequiel Lobón, Manuel Pulido
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08827
Pdf URL: https://arxiv.org/pdf/2506.08827
Copy Paste: [[2506.08827]] The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation(https://arxiv.org/abs/2506.08827)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: The extraction of information about traffic accidents from legal documents is crucial for quantifying insurance company costs. Extracting entities such as percentages of physical and/or psychological disability and the involved compensation amounts is a challenging process, even for experts, due to the subtle arguments and reasoning in the court decision. A two-step procedure is proposed: first, segmenting the document identifying the most relevant segments, and then extracting the entities. For text segmentation, two methodologies are compared: a classic method based on regular expressions and a second approach that divides the document into blocks of n-tokens, which are then vectorized using multilingual models for semantic searches (text-embedding-ada-002/MiniLM-L12-v2 ). Subsequently, large language models (LLaMA-2 7b, 70b, LLaMA-3 8b, and GPT-4 Turbo) are applied with prompting to the selected segments for entity extraction. For the LLaMA models, fine-tuning is performed using LoRA. LLaMA-2 7b, even with zero temperature, shows a significant number of hallucinations in extractions which are an important contention point for named entity extraction. This work shows that these hallucinations are substantially reduced after finetuning the model. The performance of the methodology based on segment vectorization and subsequent use of LLMs significantly surpasses the classic method which achieves an accuracy of 39.5%. Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%, surpassing its base version 61.7%. Notably, the base LLaMA-3 8B model already performs comparably to the finetuned LLaMA-2 70B model, achieving 76.6%, highlighting the rapid progress in model development. Meanwhile, GPT-4 Turbo achieves the highest accuracy at 86.1%.
摘要：从法律文件中提取有关交通事故的信息对于量化保险公司费用至关重要。提取诸如身体和/或心理残疾百分比以及所涉及的薪酬金额之类的实体，即使对于专家来说，由于法院判决中的微妙论点和推理，也是一个具有挑战性的过程。提出了一个两步的程序：首先，将识别最相关段的文档分割，然后提取实体。对于文本进行分割，比较了两种方法：一种基于正则表达式的经典方法和一种将文档划分为n ttokens块的方法，然后使用多语言模型进行语义搜索（文本 - Embedding-ADA-002/minilm-l12-v2）进行矢量化。随后，应用大型语言模型（Llama-2 7b，70b，Llama-3 8b和GPT-4 Turbo），并促使所选段以进行实体提取。对于美洲驼模型，使用Lora进行微调。 Llama-2 7b即使在零温度下，在萃取中也显示出大量的幻觉，这是指定实体提取的重要争议点。这项工作表明，对模型进行填充后，这些幻觉大大减少了。基于片段矢量化和随后使用LLM的方法的性能显着超过了达到39.5％精度的经典方法。在开源型号中，具有FINETUNINing的Llama-2 70B的精度最高79.4％，超过其基本版本61.7％。值得注意的是，基本的Llama-3 8b模型已经与易元的Llama-2 70b模型相当，达到76.6％，突出了模型开发的快速进步。同时，GPT-4 Turbo的准确性最高，为86.1％。

Title: AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Authors: Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08885
Pdf URL: https://arxiv.org/pdf/2506.08885
Copy Paste: [[2506.08885]] AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)(https://arxiv.org/abs/2506.08885)
Keywords: llm, prompt
Abstract: Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at this https URL.
摘要：对LLM的对抗威胁的升级比当前防御能力适应的速度快。我们在对齐中暴露了一个关键的几何盲点：对抗性提示剥削潜在的伪装，危险地嵌入到安全表示歧管的同时，同时编码不安全的意图，从而避免了直接偏好优化（DPO）的表面水平防御（DPO），这些防御能保持到潜在的几何形状。我们介绍了Alkali，这是第一个严格策划的对抗基准和最全面的日期，涵盖了三个宏观类别，六个亚型和15个攻击家庭的9,000个提示。对21个领先LLM的评估显示，开放式和封闭源模型的攻击成功率（ASRS）令人震惊，暴露了潜在的脆弱性，我们称其为潜在伪装，这是一个结构性的盲点，在该盲点中，对抗性完成模仿了安全的潜在几何学。为了减轻这种脆弱性，我们引入了宽限期 - 几何表示对比度增强，一个对齐框架耦合偏好学习与潜在空间正则化。 Grace执行了两个限制：安全和对抗性完成之间的潜在分离，以及不安全和越狱行为之间的对抗凝聚力。这些在层次合并的嵌入方式上运行，以学习的注意力剖面，重塑内部几何形状而不修改基本模型，并实现高达39％的ASR降低。此外，我们介绍了AVQI，这是一种几何意识到的指标，该指标可以通过群集分离和紧凑度量化潜在的对准故障。 AVQI揭示了何时不安全完成的何时模仿安全的几何形状，从而为内部编码安全性的模型提供了原则上的镜头。我们在此HTTPS URL上公开提供代码。

Title: PlantBert: An Open Source Language Model for Plant Science

Authors: Hiba Khey, Amine Lakhder, Salma Rouichi, Imane El Ghabi, Kamal Hejjaoui, Younes En-nahli, Fahd Kalloubi, Moez Amri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08897
Pdf URL: https://arxiv.org/pdf/2506.08897
Copy Paste: [[2506.08897]] PlantBert: An Open Source Language Model for Plant Science(https://arxiv.org/abs/2506.08897)
Keywords: language model
Abstract: The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantBert, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantBert is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantBert to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantBert exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields. By providing a scalable and reproducible framework for high-resolution entity recognition, PlantBert bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
摘要：基于变压器的语言模型的快速发展促进了生物医学和临床自然语言处理方面的突破。然而，这种适应领域的工具的植物科学仍然明显乏味。在这项工作中，我们提出了Plantbert，这是一种高性能的开源语言模型，专门针对从植物压力响应文献中提取结构化知识而定制。基于Deberta建筑，以其散布的关注和鲁棒的上下文编码 - 宾夕法尼亚植物的著名建筑，对精心策划的专家注销摘要的语料库进行了微调，主要关注扁豆（Lens culinaris）对多样化的非生物和生物胁迫的反应。我们的方法将基于变压器的建模与规则增强的语言后处理和本体论实体的归一化结合在一起，使Plantbert能够以精确和语义忠诚度捕获具有生物学意义的关系。使用与作物本体相一致的层次模式对基础语料库进行注释，其中包括分子，生理，生化和植物适应的农艺维度。 Plantbert在实体类型上具有强大的概括能力，并证明了在低资源科学领域适应稳健领域的可行性。通过为高分辨率实体识别提供可扩展且可重复的框架，Plantbert弥合了农业NLP的关键差距，并为植物基因组学，现象学和农艺知识发现中的智能，数据驱动的系统铺平了道路。我们的模型公开发布，以促进计算植物科学中的透明度和加速跨学科创新。

Title: From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis

Authors: Elias Horner, Cristinel Mateis, Guido Governatori, Agata Ciabattoni
Subjects: cs.CL, cs.AI, cs.CY, cs.LO
Abstract URL: https://arxiv.org/abs/2506.08899
Pdf URL: https://arxiv.org/pdf/2506.08899
Copy Paste: [[2506.08899]] From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis(https://arxiv.org/abs/2506.08899)
Keywords: language model, llm, prompt
Abstract: We present a novel approach to the automated semantic analysis of legal texts using large language models (LLMs), targeting their transformation into formal representations in Defeasible Deontic Logic (DDL). We propose a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. Our methodology is evaluated across various LLM configurations, including prompt engineering strategies, fine-tuned models, and multi-stage pipelines, focusing on legal norms from the Australian Telecommunications Consumer Protections Code. Empirical results demonstrate promising alignment between machine-generated and expert-crafted formalizations, showing that LLMs - particularly when prompted effectively - can significantly contribute to scalable legal informatics.
摘要：我们为使用大语言模型（LLMS）对法律文本进行自动语义分析的新方法进行了新的方法，将其针对不良的道词逻辑（DDL）的形式表示。我们提出了一条结构化的管道，将复杂的规范语言置于原子片段，提取道规规则，并将其评估为句法和语义连贯性。我们的方法在各种LLM配置中进行了评估，包括及时的工程策略，微调模型和多阶段管道，重点介绍澳大利亚电信消费者保护法规的法律规范。经验结果表明，机器生成和专家制作的形式化之间有希望的一致性，表明LLMS（尤其是在有效提示时）可以显着有助于可扩展的法律信息学。

Title: Dialect Normalization using Large Language Models and Morphological Rules

Authors: Antonios Dimakis (1 and 2), John Pavlopoulos (1 and 3), Antonios Anastasopoulos (1 and 4) ((1) Archimedes, Athena Research Center, Greece, (2) Department of Informatics and Telecommunications, NKUA, (3) Department of Informatics, Athens University of Economics and Business, Greece, (4) Department of Computer Science, George Mason University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08907
Pdf URL: https://arxiv.org/pdf/2506.08907
Copy Paste: [[2506.08907]] Dialect Normalization using Large Language Models and Morphological Rules(https://arxiv.org/abs/2506.08907)
Keywords: language model, llm, prompt
Abstract: Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.
摘要：自然语言理解系统与低资源语言（包括许多高资源语言的方言）斗争。方言到标准的归一化试图通过转换方言文本来解决此问题，以便在下游标准语言工具可以使用它。在这项研究中，我们通过引入一种新的归一化方法来解决这项任务，该方法将基于规则的语言知情转换和大型语言模型（LLMS）与有针对性的少弹性提示相结合，而无需任何并行数据。我们实施了希腊方言的方法，并将其应用于区域谚语数据集上，并使用人类注释者评估输出。然后，我们使用该数据集进行下游实验，发现有关这些谚语的先前结果仅依赖于浅表语言信息，包括拼字法，而仍然可以通过其余的语义进行新的观察结果。

Title: PropMEND: Hypernetworks for Knowledge Propagation in LLMs

Authors: Zeyu Leo Liu, Greg Durrett, Eunsol Choi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08920
Pdf URL: https://arxiv.org/pdf/2506.08920
Copy Paste: [[2506.08920]] PropMEND: Hypernetworks for Knowledge Propagation in LLMs(https://arxiv.org/abs/2506.08920)
Keywords: language model, llm
Abstract: Knowledge editing techniques for large language models (LLMs) can inject knowledge that is later reproducible verbatim, but they fall short on propagating that knowledge: models cannot answer questions that require reasoning with the injected knowledge. We present a hypernetwork-based approach for knowledge propagation, named PropMEND, where we meta-learn how to modify gradients of a language modeling loss to encourage injected information to propagate. Our approach extends the meta-objective of MEND [29] so that gradient updates on knowledge are transformed to enable answering multi-hop questions involving that knowledge. We show improved performance on the RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop questions whose answers are not explicitly stated in the injected fact. We further introduce a new dataset, Controlled RippleEdit, to evaluate the generalization of our hypernetwork, testing knowledge propagation along relations and entities unseen during hypernetwork training. PropMEND still outperforms existing approaches in unseen entity-relation pairs, yet the performance gap decreases substantially, suggesting future work in propagating knowledge to a wide range of relations.
摘要：大型语言模型（LLMS）的知识编辑技术可以注入后来逐字化的知识，但它们在传播这些知识方面缺乏：模型无法回答需要用注入知识推理的问题。我们提出了一种基于超网络的知识传播方法，名为Propmend，在这里我们如何修改语言建模损失的梯度以鼓励注入的信息传播。我们的方法扩展了修补的元观察者[29]，因此知识的梯度更新会被转换为启用涉及该知识的多跳问题。我们在Rippleedit数据集上显示出改进的性能，在挑战多跳的问题上显示了几乎2倍的准确性，其答案在注入的事实中未明确说明。我们进一步介绍了一个新的数据集，即控制的Rippleedit，以评估我们的超网络的概括，并在超网络培训期间未见的关系和实体测试知识传播。在看不见的实体关系对中，预言仍然优于现有方法，但性能差距大大降低，这表明将未来的工作传播到广泛的关系中。

Title: Can A Gamer Train A Mathematical Reasoning Model?

Authors: Andrew Shin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08935
Pdf URL: https://arxiv.org/pdf/2506.08935
Copy Paste: [[2506.08935]] Can A Gamer Train A Mathematical Reasoning Model?(https://arxiv.org/abs/2506.08935)
Keywords: language model, llm
Abstract: While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. this https URL.
摘要：尽管大型语言模型（LLM）在包括数学推理在内的各种任务中取得了出色的表现，但其发展通常需要过度的计算资源。最近的进步降低了能力培训模型的成本，但即使这些方法都依赖于高端硬件群集。在本文中，我们证明了单个平均游戏GPU可以通过整合增强学习和内存优化技术来训练一个固体的数学推理模型。具体而言，我们在RTX 3080 Ti上训练1.5B参数数学推理模型，该模型在数学推理基准测试基准上比在资源受限的环境中更大的模型在数学推理基准上实现了可比性或更好的性能。我们的结果挑战了最先进的数学推理需要大规模的基础设施，使获得高性能AI研究的范围。此HTTPS URL。

Title: FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

Authors: Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, Jinsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08938
Pdf URL: https://arxiv.org/pdf/2506.08938
Copy Paste: [[2506.08938]] FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation(https://arxiv.org/abs/2506.08938)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLM`s parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model`s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the model`s parametric knowledge, which undermines the model`s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https:// this http URL
摘要：通过检索系统增强的大型语言模型（LLM）在处理知识密集型任务方面具有巨大的潜力。但是，这些模型通常会在不忠实的问题上挣扎，产生忽略检索到上下文或与LLM的参数知识不一致的输出。在知识冲突的情况下，此问题尤其严重，因为检索到的上下文与模型的参数知识冲突。尽管现有的忠实的破布方法通过精心设计的提示或修改的解码策略来遵守严格的上下文，但我们的分析揭示了一个关键的局限性：它们通过强行抑制模型的参数知识来实现忠诚，从而破坏了模型的内部知识结构并增加了误解上下文的风险。为此，本文提出了Faithfulrag，这是一个新颖的框架，通过明确建模模型的参数知识和检索到的上下文来解决知识冲突。具体而言，Faithfulrag在事实层面上确定了相互矛盾的知识，并设计了一个自我思考的过程，使LLM可以在产生响应之前推理和整合矛盾的事实。广泛的实验表明，我们的方法优于最先进的方法。该代码可在https：//此http url上找到

Title: Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions

Authors: Clara Lachenmaier, Judith Sieker, Sina Zarrieß
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08952
Pdf URL: https://arxiv.org/pdf/2506.08952
Copy Paste: [[2506.08952]] Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions(https://arxiv.org/abs/2506.08952)
Keywords: language model, llm
Abstract: Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other's beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don't) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine the ability of LLMs to answer direct knowledge questions and loaded questions that presuppose misinformation. We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias. Our findings highlight significant challenges in LLMs' ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.
摘要：人类之间的交流依赖于对话的基础，即使对话者没有完美的知识，并且必须解决彼此的信念中的差异，从而使对话者能够达到相互的理解。本文研究了大型语言模型（LLM）在没有知识的情况下如何管理共同点，重点是在政治领域中的事实，而错误信息的风险和接地失败很高。我们研究了LLM回答直接知识问题和加载误解的问题的能力。我们评估已加载的问题是否导致LLM与他们的知识水平和政治偏见有关。我们的发现凸显了LLMS参与基础和拒绝虚假用户信念的能力中的重大挑战，从而引起了人们对减轻政治话语中错误信息的作用的担忧。

Title: Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Authors: Marek Kadlčík, Michal Štefánik, Timothee Mickus, Michal Spiegel, Josef Kuchař
Subjects: cs.CL, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2506.08966
Pdf URL: https://arxiv.org/pdf/2506.08966
Copy Paste: [[2506.08966]] Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers(https://arxiv.org/abs/2506.08966)
Keywords: language model
Abstract: Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models' representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns. In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings' preciseness judged by our probe's accuracy explains a large portion of LM's errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.
摘要：审计的语言模型（LMS）容易出现算术错误。现有的工作显示在模型表示的数值方面的成功有限，这表明这些错误可以归因于代表精确数量的分布嵌入的固有不可靠性。但是，我们观察到以前的探测方法不足以满足具有正弦模式的学习数字嵌入的新兴结构。作为响应，我们提出了一种新型的探测技术，该技术在一系列开源LMS中以接近完美的精度从输入嵌入中解码数值。这证明，在唯一的预训练之后，LMS表示具有显着精度的数字。最后，我们发现，根据探测器的准确性判断的嵌入精确性解释了LM在基本算术中的大部分错误，并表明将嵌入与我们的探测器发现的模式对齐可以减轻这些错误。

Title: Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

Authors: Yuan Guo, Tingjia Miao, Zheng Wu, Pengzhou Cheng, Ming Zhou, Zhuosheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08972
Pdf URL: https://arxiv.org/pdf/2506.08972
Copy Paste: [[2506.08972]] Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System(https://arxiv.org/abs/2506.08972)
Keywords: language model, agent
Abstract: Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks -- such as shot-chain execution tasks and single-screen grounding tasks -- while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at this https URL.
摘要：已经开发了由多模式大型语言模型提供动力的自主剂，以促进移动设备上的任务执行。但是，先前的工作主要集中在原子任务上（例如shot链执行任务和单屏接地任务），同时忽略了对组成任务的概括，对于现实世界应用来说是必不可少的。这项工作介绍了UI-Nexus，这是一种综合基准，旨在评估三类组成操作的移动代理：简单串联，上下文过渡和深水潜水。 UI-Nexus支持20个完全可控的本地公用事业应用环境中的交互式评估以及30个在线中文和英语服务应用程序。它包括100个交互式任务模板，平均最佳步骤计数为14.05。一系列具有代理工作流程或代理AS-A模型的移动药物的实验结果表明，UI-Nexus提出了重大挑战。具体而言，现有的代理通常努力平衡性能和效率，表现出代表性的故障模式，例如执行不足，执行过度和注意力漂移，从而导致可见的原子与复合概括差距。受这些发现的启发，我们提出了Agent-Nexus，这是一个轻巧有效的调度系统，以解决组成的移动任务。代理 - 纳克斯通过将长马利琴任务动态分解为一系列独立的原子子任务，从而推断了现有移动代理的能力。 Agent-Nexus在UI-Nexus基准中的组成操作任务上，现有移动代理的任务成功率提高了24％至40％，而无需显着牺牲推理开销。该https URL的项目页面上可用演示视频，数据集和代码。

Title: Learning to Reason Across Parallel Samples for LLM Reasoning

Authors: Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09014
Pdf URL: https://arxiv.org/pdf/2506.09014
Copy Paste: [[2506.09014]] Learning to Reason Across Parallel Samples for LLM Reasoning(https://arxiv.org/abs/2506.09014)
Keywords: language model, llm
Abstract: Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.
摘要：缩放测试时间计算为大语言模型（LLM）带来了可观的性能。通过对多个答案进行抽样并启发汇总答案（例如，通过多数投票或使用验证者对答案进行排名），可以在数学领域中获得一致的性能提高。在本文中，我们提出了一种利用此类多个样本集的新方法。我们训练一个紧凑的LLM，称为样品集合仪（SSA），该LLM采用了多个样本的串联顺序并输出最终答案，并通过增强学习将其优化为答案的准确性。多个推理数据集的实验表明，SSA优于其他测试时间缩放方法，例如基于奖励模型的重新排列。我们的方法还显示了跨样本尺寸，基本模型家族和量表以及任务的有希望的概括能力。通过将LLM分开以生成答案和LLM来分析和汇总采样答案，我们的方法可以轻松有效地与Premier Black Box模型的输出一起使用。

Title: Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features

Authors: Hakyung Sung, Karla Csuros, Min-Chang Sung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09021
Pdf URL: https://arxiv.org/pdf/2506.09021
Copy Paste: [[2506.09021]] Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features(https://arxiv.org/abs/2506.09021)
Keywords: gpt, llm, chat
Abstract: This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.
摘要：这项研究研究了旨在提高相同第二语言著作的整体清晰度的人和LLM校对的词汇和句法干预措施，并评估了三个LLMS（Chatgpt-4O，Llama3.1-8B，DeepSeek-R1-8B）的结果的一致性。研究结果表明，人类和LLM校对都增强了Bigram词汇特征，这可能有助于相邻单词之间的更好的连贯性和上下文连接。但是，LLM校对表现出更具生成性的方法，可以广泛地重制词汇和句子结构，例如采用更多样化和复杂的词汇，并在名词短语中纳入了更多的形容词修饰符。校对结果在三个模型的主要词汇和句法特征中高度一致。

Title: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Authors: Haozhen Zhang, Tao Feng, Jiaxuan You
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09033
Pdf URL: https://arxiv.org/pdf/2506.09033
Copy Paste: [[2506.09033]] Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning(https://arxiv.org/abs/2506.09033)
Keywords: language model, llm
Abstract: The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost this http URL is available at this https URL.
摘要：各种大型语言模型（LLM）的快速出现刺激了LLM路由器的开发，这些路由器将用户查询分配给最合适的模型。但是，现有的LLM路由器通常执行单轮映射（\ textIt {i.e。}，将每个查询分别分配给一个单个模型），这限制了他们处理需要多个LLMS的互补强度的复杂任务的能力。在本文中，我们提出\ textbf {router-r1}，这是一种基于增强的学习（RL）基于基于的框架，该框架将多LLLM路由和聚合作为顺序决策过程。 Router-R1将路由器本身实例化为有能力的LLM，利用其交织的推理能力将“思考”动作（内部审议）与“路线”动作（动态模型调用）进行，并将每个响应集成到其不断发展的上下文中。为了指导学习，我们采用了一个基于规则的轻巧奖励，包括格式奖励，最终结果奖励以及针对性能和成本权衡优化的新颖成本奖励，开辟了通过RL优化性能成本权衡的途径。 Router-r1还仅在简单模型描述符（例如定价，延迟和示例性能）上进行条件，从而使强大的概括能够看到观察到的模型选择。在七个通用和多跳的QA基准上进行的实验表明，路由器R1的表现优于几个强质基线，在此HTTPS URL上可获得较高的性能，同时保持强大的概括并使此HTTP URL付出。

Title: Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Authors: Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09047
Pdf URL: https://arxiv.org/pdf/2506.09047
Copy Paste: [[2506.09047]] Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs(https://arxiv.org/abs/2506.09047)
Keywords: language model
Abstract: Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.
摘要：视觉语言模型（VLMS）显示出令人印象深刻的能力，可以回答有关视觉输入的问题（例如，计数图像中的对象），但在文本上执行类似的任务时（例如，在文本中计数单词）在执行类似的任务时表现出更高的准确性。我们通过识别和比较\ textit {电路} - 以不同方式来研究这种准确性差距。我们表明，虽然电路在很大程度上是模式之间的不相交，但它们实现了相对相似的功能：差异主要在于处理特定于模态的数据位置（图像或文本序列）。缩放图像数据表示，我们观察到它们与较高表现的类似文本表示仅适用于后来的层，而在处理中为时已晚，无法有效地影响后续位置。为了克服这一点，我们将后期层的视觉数据令牌的表示形式修补为早期的层。在使用多个任务和模型的实验中，这种简单的干预措施平均消除了模式之间的性能差距的三分之一。我们的分析阐明了VLMS中的多模式性能差距，并提出了一种无训练的方法来减少它。