2025-08-07

Title: How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion

Authors: Agrima Seth, Monojit Choudhary, Sunayana Sitaram, Kentaro Toyama, Aditya Vashistha, Kalika Bali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03712
Pdf URL: https://arxiv.org/pdf/2508.03712
Copy Paste: [[2508.03712]] How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion(https://arxiv.org/abs/2508.03712)
Keywords: language model, gpt, llm, prompt
Abstract: Representational bias in large language models (LLMs) has predominantly been measured through single-response interactions and has focused on Global North-centric identities like race and gender. We expand on that research by conducting a systematic audit of GPT-4 Turbo to reveal how deeply encoded representational biases are and how they extend to less-explored dimensions of identity. We prompt GPT-4 Turbo to generate over 7,200 stories about significant life events (such as weddings) in India, using prompts designed to encourage diversity to varying extents. Comparing the diversity of religious and caste representation in the outputs against the actual population distribution in India as recorded in census data, we quantify the presence and "stickiness" of representational bias in the LLM for religion and caste. We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation, despite prompts intended to encourage representational diversity. Our findings also suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data, and repeated prompt-based nudges have limited and inconsistent efficacy in dislodging these biases. These results suggest that diversifying training data alone may not be sufficient to correct LLM bias, highlighting the need for more fundamental changes in model development. Dataset and Codebook: this https URL
摘要：大语言模型（LLM）中的代表性偏见主要是通过单一响应相互作用来衡量的，并集中于种族和性别等全球以北的身份。我们通过对GPT-4 Turbo进行系统的审核来揭示该研究的扩展，以揭示编码的代表性偏见的深度以及它们如何扩展到较小的身份维度。我们促使GPT-4 Turbo使用旨在鼓励多样性变化范围的提示，生成有关印度重大生活事件（例如婚礼）的7,200多个故事。将宗教和种姓代表性的多样性与人口普查数据中记录的印度实际人口分布进行比较，我们量化了代表性偏见在LLM中的存在和“粘性”。我们发现，尽管旨在鼓励代表性多样性的提示，但GPT-4的反应始终超出了文化主导的群体远远超出其统计代表。我们的发现还表明，LLMS中的代表性偏见具有赢家的全部质量，比其培训数据中可能的分布偏见更具偏差，并且反复的基于及时的午睡在消除这些偏见方面的功效有限和不一致的功效。这些结果表明，仅多元化的培训数据可能不足以纠正LLM偏见，从而强调了模型开发中更根本的变化的必要性。数据集和代码簿：此HTTPS URL

Title: FeynTune: Large Language Models for High-Energy Theory

Authors: Paul Richmond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis
Subjects: cs.CL, cs.LG, hep-th
Abstract URL: https://arxiv.org/abs/2508.03716
Pdf URL: https://arxiv.org/pdf/2508.03716
Copy Paste: [[2508.03716]] FeynTune: Large Language Models for High-Energy Theory(https://arxiv.org/abs/2508.03716)
Keywords: language model, gpt, llm, chat
Abstract: We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.
摘要：我们为理论高能物理学提供了专门的大语言模型，该模型是80亿参数Llama-3.1模型的20个微调变体。从HEP-TH，HEP-PH和GR-QC组合的不同组合，对每种变体都经过Arxiv摘要的培训（至2024年8月）。对于比较研究，我们还培训了数据集上的模型，这些模型包含来自Q-BIO和CS类别等不同领域的摘要。所有模型均使用两种不同的低级适应微调方法和不同的数据集大小进行微调，并在HEP-Th-Thr抽象完成任务上表现优于基本模型。我们将绩效与领先的商业LLM（Chatgpt，Claude，Gemini，DeepSeek）进行比较，并获得了进一步开发高能理论物理学专业语言模型的见解。

Title: Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering

Authors: Abhay Vijayvargia, Ajay Nagpal, Kundeshwar Pundalik, Atharva Savarkar, Smita Gautam, Pankaj Singh, Rohit Saluja, Ganesh Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03719
Pdf URL: https://arxiv.org/pdf/2508.03719
Copy Paste: [[2508.03719]] Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering(https://arxiv.org/abs/2508.03719)
Keywords: chat, retrieval-augmented generation
Abstract: Indian farmers often lack timely, accessible, and language-friendly agricultural advice, especially in rural areas with low literacy. To address this gap in accessibility, this paper presents a novel AI-powered agricultural chatbot, Krishi Sathi, designed to support Indian farmers by providing personalized, easy-to-understand answers to their queries through both text and speech. The system's intelligence stems from an IFT model, subsequently refined through fine-tuning on Indian agricultural knowledge across three curated datasets. Unlike traditional chatbots that respond to one-off questions, Krishi Sathi follows a structured, multi-turn conversation flow to gradually collect the necessary details from the farmer, ensuring the query is fully understood before generating a response. Once the intent and context are extracted, the system performs Retrieval-Augmented Generation (RAG) by first fetching information from a curated agricultural database and then generating a tailored response using the IFT model. The chatbot supports both English and Hindi languages, with speech input and output features (via ASR and TTS) to make it accessible for users with low literacy or limited digital skills. This work demonstrates how combining intent-driven dialogue flows, instruction-tuned models, and retrieval-based generation can improve the quality and accessibility of digital agricultural support in India. This approach yielded strong results, with the system achieving a query response accuracy of 97.53%, 91.35% contextual relevance and personalization, and a query completion rate of 97.53%. The average response time remained under 6 seconds, ensuring timely support for users across both English and Hindi interactions.
摘要：印度农民通常缺乏及时，可访问和对语言友好的农业建议，尤其是在识字率低的农村地区。为了解决可访问性的这一差距，本文介绍了一种新颖的AI驱动农业聊天机器人Krishi Sathi，旨在通过通过文本和语音提供个性化，易于理解的查询来支持印度农民。该系统的智能源于IFT模型，随后通过对三个策划数据集的印度农业知识进行微调而进行了完善。与对一次性问题的传统聊天机器人不同，克里希·萨蒂（Krishi Sathi）遵循一个结构化的多转交谈流，以逐渐从农民那里收集必要的细节，从而确保在产生回答之前完全理解查询。一旦提取了意图和上下文，系统将通过先获取策划的农业数据库的信息，然后使用IFT模型生成量身定制的响应，从而执行检索功能。聊天机器人支持英语和印地语语言，并具有语音输入和输出功能（通过ASR和TTS），以使其可用于低识字或有限的数字技能的用户。这项工作表明，意图驱动的对话流，指导调节的模型以及基于检索的一代如何改善印度数字农业支持的质量和可访问性。这种方法产生了强劲的结果，系统实现了97.53％，91.35％上下文相关性和个性化的查询响应准确性，查询完成率为97.53％。平均响应时间保持在6秒钟以下，确保了英语和印地语交互的用户及时支持。

Title: Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

Authors: Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03726
Pdf URL: https://arxiv.org/pdf/2508.03726
Copy Paste: [[2508.03726]] Hierarchical Verification of Speculative Beams for Accelerating LLM Inference(https://arxiv.org/abs/2508.03726)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.
摘要：大型语言模型（LLM）在各种自然语言处理任务中取得了巨大的成功，但由于其自动回归性质，在推理效率方面面临着持续的挑战。虽然投机解码和光束采样可显着改进，但传统方法依次核实草稿序列而没有优先级，从而导致不必要的计算开销。这项工作提出了分层验证树（HVT），这是一个新型框架，通过优先考虑高样本草稿并实现次优候选者的早期修剪来重组投机梁解码。制定了理论基础和正式验证算法，以确保正确性和效率。与标准LLM推理管道的集成无需重新修改或架构修改。跨多个数据集和模型的实验评估表明，HVT始终优于现有的投机解码方案，从而实现了推理时间和能源消耗的大量减少，同时保持或增强了产出质量。这些发现突出了分层验证策略的潜力，这是加速大型语言模型推断的新方向。

Title: WINELL: Wikipedia Never-Ending Updating with LLM Agents

Authors: Revanth Gangi Reddy, Tanay Dixit, Jiaxin Qin, Cheng Qian, Daniel Lee, Jiawei Han, Kevin Small, Xing Fan, Ruhi Sarikaya, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03728
Pdf URL: https://arxiv.org/pdf/2508.03728
Copy Paste: [[2508.03728]] WINELL: Wikipedia Never-Ending Updating with LLM Agents(https://arxiv.org/abs/2508.03728)
Keywords: gpt, llm, agent
Abstract: Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia's extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL's ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.
摘要：维基百科（Wikipedia）是一个庞大而持续咨询的知识库，由于其依赖手动人工编辑，在保持最新内容方面面临重大挑战。受到内尔（Nell）持续知识获取的愿景的启发，并受到LLM基代理商的进步的推动，因此介绍了Winell，这是一个不断更新Wikipedia文章的代理框架。我们的方法采用多代理框架来汇总在线信息，为Wikipedia中的目标实体选择新的和重要的知识，然后为人类审查生成精确的编辑建议。我们对Wikipedia广泛的人类编辑历史进行了培训的细粒度编辑模型，以与人类编辑行为一致的方式结合更新。我们的编辑器模型在关键信息覆盖范围和编辑效率方面都优于开源指令跟随基准和闭合源LLM（例如GPT-4O）。 End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL's ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.

Title: GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models

Authors: Ashutosh Bandooni, Brindha Subburaj
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03737
Pdf URL: https://arxiv.org/pdf/2508.03737
Copy Paste: [[2508.03737]] GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models(https://arxiv.org/abs/2508.03737)
Keywords: language model, gpt, chain-of-thought
Abstract: Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it's highest average accuracy being 38.15%. We also evaluate models through a "Double Lock" constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work.
摘要：在过去的几年中，更频繁地策划了几个领域和领域的视觉语言模型（VLM）中推理的基准。但是，这些通常是单语的，主要提供英语。此外，除了理解和翻译外，印地语中还缺少有关任务的数据集。我们介绍了Ganitbench，这是一个艰难的基准，包括1527个仅视觉问题，涵盖数学的几个主题 - 可提供英语和印地语。该基准从印度的两次主要考试中收集到了JEE Advanced和CBSE董事会考试，其中包含了图像形式的问题，这些图像的形式包括问题以及文本所必需的图形。我们在零射击链（COT）和两击COT设置中评估了两个封闭的源模型。发现GPT-4O mini是基准上更为主导的模型，其平均准确性最高为38.15％。我们还通过“双重锁定”约束来评估模型，该约束通过相当大的利润来降低模型的性能。我们观察到，在这种环境下，两弹壳似乎是更有效的环境。在回答印地语语言相同的问题时，两个VLM的性能也会降低。我们希望通过我们的工作促进像印地语这样的语言。

Title: AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Authors: Yanting Wang, Runpeng Geng, Ying Chen, Jinyuan Jia
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2508.03793
Pdf URL: https://arxiv.org/pdf/2508.03793
Copy Paste: [[2508.03793]] AttnTrace: Attention-based Context Traceback for Long-Context LLMs(https://arxiv.org/abs/2508.03793)
Keywords: language model, llm, long context, prompt, retrieval-augmented generation, agent
Abstract: Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at this https URL.
摘要：越来越多地用于增强高级AI系统的能力，包括检索功能（RAG）管道（RAG）管道和自动源代理。在这些系统中，LLM收到指令以及上下文（通常由知识数据库或内存检索的文本组成），并生成响应，该响应是通过遵循指令来扎根的响应。最近的研究设计了解决方案，可以追溯到对LLM产生的响应最大的上下文中的一部分文本。这些解决方案具有许多现实世界的应用，包括进行攻击后的法医分析以及提高LLM输出的可解释性和可信度。尽管已经做出了重大努力，但最先进的解决方案（例如Tracllm）通常会导致高计算成本，例如，对于单个响应封文对，需要数百秒钟才能进行回溯。在这项工作中，我们提出了Attntrace，这是一种基于LLM为提示产生的注意力权重的新上下文回溯方法。为了有效地利用注意力，我们介绍了两种旨在提高attntrace有效性的技术，并为我们的设计选择提供了理论见解。我们还对ATTNTRACE进行系统评估。结果表明，与现有的最新上下文追溯方法相比，ATTNTRACE更准确和高效。我们还表明，ATTNTRACE可以通过归因于检测范式在长篇小说下检测快速注射的最新方法。作为现实世界的应用程序，我们证明了Attntrace可以在旨在操纵LLM生成的评论的论文中有效地注射指令。该代码在此HTTPS URL上。

Title: Majority Bit-Aware Watermarking For Large Language Models

Authors: Jiahao Xu, Rui Hu, Zikai Zhang
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2508.03829
Pdf URL: https://arxiv.org/pdf/2508.03829
Copy Paste: [[2508.03829]] Majority Bit-Aware Watermarking For Large Language Models(https://arxiv.org/abs/2508.03829)
Keywords: language model, llm
Abstract: The growing deployment of Large Language Models (LLMs) in real-world applications has raised concerns about their potential misuse in generating harmful or deceptive content. To address this issue, watermarking techniques have emerged as a promising solution by embedding identifiable binary messages into generated text for origin verification and misuse tracing. While recent efforts have explored multi-bit watermarking schemes capable of embedding rich information such as user identifiers, they typically suffer from the fundamental trade-off between text quality and decoding accuracy: to ensure reliable message decoding, they have to restrict the size of preferred token sets during encoding, yet such restrictions reduce the quality of the generated content. In this work, we propose MajorMark, a novel watermarking method that improves this trade-off through majority bit-aware encoding. MajorMark selects preferred token sets based on the majority bit of the message, enabling a larger and more flexible sampling of tokens. In contrast to prior methods that rely on token frequency analysis for decoding, MajorMark employs a clustering-based decoding strategy, which maintains high decoding accuracy even when the preferred token set is large, thus preserving both content quality and decoding accuracy. We further introduce MajorMark$^+$, which partitions the message into multiple blocks to independently encode and deterministically decode each block, thereby further enhancing the quality of watermarked text and improving decoding accuracy. Extensive experiments on state-of-the-art LLMs demonstrate that our methods significantly enhance both decoding accuracy and text generation quality, outperforming prior multi-bit watermarking baselines.
摘要：在现实世界应用中，大型语言模型（LLM）的部署日益增长引起了人们对产生有害或欺骗性内容的潜在滥用的担忧。为了解决这个问题，通过将可识别的二进制消息嵌入生成的文本中以进行原始验证和滥用跟踪，水印技术已成为有前途的解决方案。尽管最近的努力探索了能够嵌入丰富信息（例如用户标识符）的多位水印方案，但它们通常会遭受文本质量和解码准确性之间的基本权衡：为了确保可靠的消息解码，它们必须限制编码过程中首选的标记集的大小，但是这种限制却减少了生成质量的质量。在这项工作中，我们提出了Majormark，这是一种新颖的水印方法，可以通过多数刻张的编码来改善这种权衡。 Majormark根据消息的大多数选择选择了首选令牌集，从而实现了代币的更大，更灵活的采样。与依靠令牌频率分析进行解码的先前方法相反，Majormark采用了基于聚类的解码策略，即使首选令牌集很大，该策略即使保持较高的解码精度，也可以保持内容质量和解码精度。我们进一步介绍了Majormark $^+$，该$^+$将消息划分为多个块以独立编码和确定性地解码每个块，从而进一步提高了水印文本的质量并提高了解码精度。对最先进的LLM的广泛实验表明，我们的方法显着提高了解码的准确性和文本生成质量，表现优于先前的多位水印基线。

Title: Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

Authors: Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03860
Pdf URL: https://arxiv.org/pdf/2508.03860
Copy Paste: [[2508.03860]] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models(https://arxiv.org/abs/2508.03860)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.
摘要：大型语言模型（LLMS）经过广泛而多样化的互联网语料库的培训，通常包括不准确或误导性内容。因此，LLM可能会产生错误信息，从而使事实核对重要的是必不可少的。这篇评论系统地分析了如何通过探索诸如幻觉，数据集限制和评估指标的可靠性之类的关键挑战来评估LLM生成的内容以获得事实准确性。审查强调了需要强大的事实检查框架，这些框架整合了高级提示策略，特定于领域的微调和检索效果生成（RAG）方法。它提出了五个研究问题，可以指导对2020年至2025年最近文献的分析，重点介绍评估方法和缓解技术。该评论还讨论了通过抹布框架进行教学调整，多代理推理和外部知识访问的作用。关键发现突出了当前指标的局限性，具有经过验证的外部证据的接地输出的价值以及特定于域特定的自定义的重要性，以提高事实一致性。总体而言，该评论强调了建立LLM的重要性，而LLM不仅是准确且可解释的，而且还针对特定于领域的事实检查而定制。这些见解有助于将研究发展到更值得信赖和上下文感知的语言模型。

Title: An Entity Linking Agent for Question Answering

Authors: Yajie Luo, Yihong Wu, Muzhi Li, Fengran Mo, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03865
Pdf URL: https://arxiv.org/pdf/2508.03865
Copy Paste: [[2508.03865]] An Entity Linking Agent for Question Answering(https://arxiv.org/abs/2508.03865)
Keywords: language model, long context, agent
Abstract: Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.
摘要：一些问题回答（QA）系统依靠知识库（KB）来提供准确的答案。链接（EL）的实体在将自然语言提及与KB条目联系起来起着关键作用。但是，大多数现有的EL方法都是为长上下文设计的，并且在质量检查任务中的简短，模棱两可的用户问题上表现不佳。我们基于模拟人类认知工作流的大型语言模型，建议质量保证的实体链接代理。代理商积极识别实体提到，检索候选实体并做出决定。为了验证代理的有效性，我们进行了两个实验：基于工具的实体链接和QA任务评估。结果证实了我们代理的鲁棒性和有效性。

Title: Sotopia-RL: Reward Design for Social Intelligence

Authors: Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03905
Pdf URL: https://arxiv.org/pdf/2508.03905
Copy Paste: [[2508.03905]] Sotopia-RL: Reward Design for Social Intelligence(https://arxiv.org/abs/2508.03905)
Keywords: language model, llm, agent
Abstract: Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: this https URL.
摘要：社会情报已成为大型语言模型（LLM）的关键能力，使他们能够有效参与现实世界中的社会任务，例如住宿，说服，协作和谈判。强化学习（RL）是培训社会智能代理商的自然适合，因为它允许模型通过社交互动直接学习复杂的策略。但是，社交互动具有两个关键特征，这些特征设定了RL培训的障碍：（1）部分可观察性，话语具有间接和延迟的效果，使信用分配复杂化，以及（2）多维性，在这种行为中，诸如互补的建设或寻求知识的行为可以间接地促进目标成就。这些特征使马尔可夫决策过程（MDP）基于单维情节级奖励效率低下且不稳定。为了应对这些挑战，我们提出了Sotopia-RL，这是一个新颖的框架，将粗略的情节级别的反馈改进到了话语级别的多维奖励中。话语级信用分配通过将结果归因于单个话语来减轻部分可观察性，而多维奖励却捕捉了社交互动的全部丰富性并减少了奖励黑客攻击。开放式的社会学习环境Sotopia的实验表明，Sotopia-RL实现了最先进的社会目标完成分数（Sotopia-Hard上的7.17，Sotopia-Full上的8.31）表现出了明显超过现有方法。消融研究证实了RL培训的话语级信用分配和多维奖励设计的必要性。我们的实现可公开可用：此HTTPS URL。

Title: CoAct-1: Computer-using Agents with Coding as Actions

Authors: Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03923
Pdf URL: https://arxiv.org/pdf/2508.03923
Copy Paste: [[2508.03923]] CoAct-1: Computer-using Agents with Coding as Actions(https://arxiv.org/abs/2508.03923)
Keywords: agent
Abstract: Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.
摘要：通过图形用户界面（GUI）操作计算机的自主量通常会在复杂的，长马的任务上以效率和可靠性而挣扎。尽管通过计划人员增强这些代理可以改善任务分解，但它们仍然受到通过GUI操纵执行所有操作的固有局限性的限制，从而导致脆弱性和效率低下。在这项工作中，我们引入了一个更健壮，更灵活的范式：使代理商可以使用编码作为增强动作。我们提出COACT-1，这是一种新型的多代理系统，可以协同结合基于GUI的控制与直接程序化执行。 COACT-1具有编排器，该编目将子任务动态委派给常规的GUI操作员或专门的程序员代理，该程序员可以编写和执行Python或Bash脚本。这种混合方法允许代理绕过诸如文件管理和数据处理之类的任务效率低下的GUI动作序列，同时仍然在必要时利用视觉交互。我们在具有挑战性的OSWORLD基准上评估了我们的系统，在该基准中，COACT-1获得了60.76％的最新成功率，显着优于先前的方法。此外，我们的方法极大地提高了效率，将完成任务完成所需的步骤数量减少到10.15，而领先的GUI代理人为15个。我们的结果表明，将编码集成为核心操作提供了更强大，高效，可扩展的计算机自动化路径。

Title: CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation

Authors: Raymond Wilson, Cole Graham, Chase Carter, Zefeng Yang, Ruiqi Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03935
Pdf URL: https://arxiv.org/pdf/2508.03935
Copy Paste: [[2508.03935]] CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation(https://arxiv.org/abs/2508.03935)
Keywords: language model, llm, hallucination
Abstract: In the era of information overload, personalized news headline generation is crucial for engaging users by tailoring content to their preferences while accurately conveying news facts. Existing methods struggle with effectively capturing complex user interests and ensuring factual consistency, often leading to generic or misleading headlines. Leveraging the unprecedented capabilities of Large Language Models (LLMs) in text generation, we propose Context-Augmented Personalized LLM (CAP-LLM), a novel framework that integrates user preferences and factual consistency constraints into a powerful pre-trained LLM backbone. CAP-LLM features a User Preference Encoder to capture long-term user interests, a Context Injection Adapter to seamlessly integrate these preferences and current article context into the LLM's generation process, and a Fact-Consistency Reinforcement Module employing a novel contrastive loss to mitigate hallucination. Evaluated on the real-world PENS dataset, CAP-LLM achieves state-of-the-art performance across all metrics. Notably, it significantly improves factual consistency (FactCC of 87.50) over strong baselines like BART (86.67), while simultaneously enhancing personalization (Pc(avg) 2.73, Pc(max) 17.25) and content coverage (ROUGE-1 26.55, ROUGE-2 9.95, ROUGE-L 23.01). Our ablation studies, human evaluations, and sensitivity analyses further validate the effectiveness of each component and the robustness of our approach, demonstrating CAP-LLM's ability to achieve a superior balance between personalization and factual accuracy in news headline generation.
摘要：在信息超负荷的时代，个性化的新闻标题生成对于使用户量身定制内容的同时，在准确传达新闻事实的同时，将内容定制为吸引用户至关重要。现有方法在有效捕获复杂的用户兴趣并确保事实一致性方面遇到困难，通常会导致通用或误导性的头条新闻。利用大型语言模型（LLMS）的前所未有的功能，我们提出了上下文设置的个性化LLM（CAP-LLM），这是一个新颖的框架，将用户的偏好和事实一致性约束整合到一个强大的预培养的LLM backbone中。 CAP-LLM具有用户偏好编码器来捕获长期用户兴趣，上下文注射适配器将这些偏好和当前文章上下文无缝整合到LLM的生成过程中，以及通过新颖的对比度损失来减轻幻觉的事实一致性增强模块。 CAP-LLM在现实世界中的数据集上进行了评估，在所有指标上都达到了最先进的性能。值得注意的是，它显着提高了诸如Bart（86.67）之类的强基础的事实一致性（FACTCC为87.50），同时增强了个性化（PC（AVG）2.73，PC（MAX）17.25）和内容覆盖率（Rouge-1 26.55，Rouge-2 9.95，Rouge-2 9.95，Rouge-l 23.01）。我们的消融研究，人类评估和灵敏度分析进一步验证了每个组成部分的有效性以及我们方法的鲁棒性，这表明CAP-LLM能够在新闻头条生成中实现个性化和事实准确性之间取得卓越平衡的能力。

Title: Data and AI governance: Promoting equity, ethics, and fairness in large language models

Authors: Alok Abhishek, Lisa Erickson, Tushar Bandopadhyay
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03970
Pdf URL: https://arxiv.org/pdf/2508.03970
Copy Paste: [[2508.03970]] Data and AI governance: Promoting equity, ethics, and fairness in large language models(https://arxiv.org/abs/2508.03970)
Keywords: language model, llm
Abstract: In this paper, we cover approaches to systematically govern, assess and quantify bias across the complete life cycle of machine learning models, from initial development and validation to ongoing production monitoring and guardrail implementation. Building upon our foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models, the authors share prevalent bias and fairness related gaps in Large Language Models (LLMs) and discuss data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. The data and AI governance approach discussed in this paper is suitable for practical, real-world applications, enabling rigorous benchmarking of LLMs prior to production deployment, facilitating continuous real-time evaluation, and proactively governing LLM generated responses. By implementing the data and AI governance across the life cycle of AI development, organizations can significantly enhance the safety and responsibility of their GenAI systems, effectively mitigating risks of discrimination and protecting against potential reputational or brand-related harm. Ultimately, through this article, we aim to contribute to advancement of the creation and deployment of socially responsible and ethically aligned generative artificial intelligence powered applications.
摘要：在本文中，我们涵盖了在机器学习模型的完整生命周期中系统地管理，评估和量化偏见的方法，从初始开发和验证到正在进行的生产监控和护栏实施。在我们关于大语言模型的偏见评估和评估测试套件（BEATS）的基础工作的基础上，作者在大语言模型（LLMS）中共享普遍的偏见和与公平性相关的差距，并讨论数据和AI治理框架，以解决LLMS内的偏见，伦理，道德，公平和事实。本文讨论的数据和AI治理方法适用于实用，现实世界中的应用，在生产部署之前对LLM进行严格的基准测试，促进持续的实时评估，并主动管理LLM生成的响应。通过在AI发展的整个生命周期中实施数据和AI治理，组织可以显着提高其Genai系统的安全性和责任，有效地减轻歧视风险并保护潜在的声誉或与品牌相关的危害。最终，通过本文，我们旨在为促进具有社会负责和道德的生成人工智能应用程序的创建和部署而做出贡献。

Title: Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency

Authors: Md Arafat Sultan, Ramón Fernandez Astudillo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03979
Pdf URL: https://arxiv.org/pdf/2508.03979
Copy Paste: [[2508.03979]] Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency(https://arxiv.org/abs/2508.03979)
Keywords: llm, chain-of-thought
Abstract: Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model's own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.
摘要：尽管具有简单性和有效性，但自我持续性的高令牌的支出很高，可以限制其实际效用。在这里，我们调查是否可以通过早期假设修剪来保留其并行性的同时，在维护其并行性的同时，是否可以使自我矛盾更有效。具体而言，我们根据两个轻量级指标并非必要的情况下并行修剪所有解决方案，但会定期修剪中间假设，这些假设被认为是不必要的：（a）该模型自身对单个假设的信心，以及（b）正在考虑继续考虑的候选人基因的所有当前假设的词汇覆盖。我们设计了使用两个指示器的快速加权套装算法。我们对三个数学基准的五个LLM的评估表明，在许多情况下，此方法可以提高所有模型的令牌效率，10-35％。

Title: Are Today's LLMs Ready to Explain Well-Being Concepts?

Authors: Bohan Jiang, Dawei Li, Zhen Tan, Chengshuai Zhao, Huan Liu
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2508.03990
Pdf URL: https://arxiv.org/pdf/2508.03990
Copy Paste: [[2508.03990]] Are Today's LLMs Ready to Explain Well-Being Concepts?(https://arxiv.org/abs/2508.03990)
Keywords: language model, llm
Abstract: Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.
摘要：幸福感包括对个人成长至关重要的心理，身体和社会层面和知情的生活决策。随着个人越来越多地咨询大型语言模型（LLM）来了解幸福感，出现了一个关键的挑战：LLM可以产生不仅准确，而且对不同受众量身定制的解释吗？高质量的解释既需要事实正确性，又需要满足具有不同专业知识的用户的期望的能力。在这项工作中，我们构建了一个大规模数据集，其中包含43,880个解释，其中有2,194个良好的概念，由十个不同的LLM产生。我们介绍了一个原则引导的LLM-AS-A-A-Gudge评估框架，并采用双重法官来评估解释质量。此外，我们表明，使用监督的微调（SFT）和直接偏好优化（DPO）对开源LLM进行微调可以显着提高生成的解释质量。我们的结果表明：（1）拟议的LLM法官与人类评估很好地符合；（2）在模型，受众和类别中的解释质量差异很大；（3）DPO和SFT-FINETENED模型的表现优于其更大的对应物，证明了基于优先的学习对专业解释任务的有效性。

Title: Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

Authors: Xinyu Zhao, Zhen Tan, Maya Enisman, Minjae Seo, Marta R. Durantini, Dolores Albarracin, Tianlong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03998
Pdf URL: https://arxiv.org/pdf/2508.03998
Copy Paste: [[2508.03998]] Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models(https://arxiv.org/abs/2508.03998)
Keywords: agent
Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but "black box" foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot's reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert's cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.
摘要：成功的小组会议，例如在团体行为改变计划，工作会议和其他社会环境中实施的会议，必须促进个人目标设定和执行，同时加强小组内部的社会关系。因此，理想的促进者必须对脱离接触的微妙动态，个人目标设定和执行的困难以及表明需要干预的人际交往困难敏感。促进者所经历的挑战和认知负荷为具体的技术创造了一个关键的差距，该技术可以解释社会交流，同时仍然意识到小组中个人的需求，并提供透明的建议，这些建议超出了强大但“黑匣子”的基础模型（FMS），以识别社会线索。我们通过社交机器人共同利益来解决这一重要需求，该社会机器人共同利益分析多模式会议数据并为促进者提供谨慎的提示。机器人的推理由代理概念瓶颈模型（CBM）提供动力，该模型基于人类解剖的概念做出决定，例如参与者参与和情感，从而确保了透明度和可信赖性。我们的核心贡献是一个转移学习框架，它将FM的广泛社会理解提炼成我们的专业和透明的CBM。该概念驱动的系统在预测干预的需求方面极大地超过了直接零击FM，并实现了人类对推理的实时校正。至关重要的是，我们证明了强大的知识转移：该模型跨不同的群体概括，并成功地转移了高级人类促进者的专业知识，以提高新手的表现。通过将专家的认知模型转移到可解释的机器人合作伙伴中，我们的工作为增强复杂社会领域的人类能力提供了强大的蓝图。

Title: HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

Authors: Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, Shengyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04010
Pdf URL: https://arxiv.org/pdf/2508.04010
Copy Paste: [[2508.04010]] HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization(https://arxiv.org/abs/2508.04010)
Keywords: language model, agent
Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: this https URL.
摘要：大型语言模型使代理可以在开放的Web环境中自主执行任务。但是，随着网络中隐藏的威胁的发展，网络代理商在长期操作过程中面临平衡任务绩效与新兴风险的挑战。尽管这一挑战至关重要，但当前的研究仍然限于单目标优化或单转情况，缺乏在网络环境中对安全性和实用性进行协作优化的能力。为了解决这一差距，我们提出了HarmonyGuard，这是一个多代理协作框架，利用政策增强和客观优化，以共同改善效用和安全性。 HarmonyGuard具有多个基本功能为特征的多机构体系结构：（1）自适应策略增强：我们在HarmonyGuard中介绍了策略代理，该策略代理自动从非结构化的外部文档中提取并维护结构化的安全策略，同时持续更新策略，以响应对变化威胁的响应。（2）双目标优化：基于安全和公用事业的双重目标，HarmonyGuard中集成的效用代理执行马尔可夫实时推理，以评估目标并利用元认知能力进行优化。对多个基准测试的广泛评估表明，HarmonyGuard将政策合规提高了38％，任务完成高达20％，而现有基线的依从性则达到了所有任务的90％以上的政策合规性。我们的项目可在此处提供：此HTTPS URL。

Title: Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing

Authors: Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04012
Pdf URL: https://arxiv.org/pdf/2508.04012
Copy Paste: [[2508.04012]] Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing(https://arxiv.org/abs/2508.04012)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.
摘要：大型语言模型（LLM）是许多AI应用程序的基础，但是它们的静态性质使更新知识的更新昂贵。模型编辑通过通过针对性的参数修改注入新信息来提供有效的替代方法。特别是，基于元学习的模型编辑（MLBME）方法在编辑有效性和效率方面都表现出显着的优势。尽管如此，我们发现MLBME在低数据局场景中表现出次优性能，并且通过计算KL Divergence的计算，其训练效率被瓶颈瓶颈。要解决这些问题，我们建议$ \ textbf {s} $ tep $ \ textbf {m} $ ore $ \ textbf {edit} $（$ \ textbf {smedit} $），一种新颖的mlbme方法$ \ textbf {b} $ ackpro $ \ textbf {p} $ agation $ \ textbf {s} $ teps（$ \ textbf {mbps} $）以在有限的监督下提高编辑性能，并在有限的监督下进行规范正规化，以提高训练效率。在两个数据集和两个LLMS上的实验结果表明，SMEDIT在MLBME基准和MBPS策略之前的表现可以无缝地集成到现有方法中，以进一步提高其性能。我们的代码将很快发布。

Title: ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Authors: Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.04038
Pdf URL: https://arxiv.org/pdf/2508.04038
Copy Paste: [[2508.04038]] ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents(https://arxiv.org/abs/2508.04038)
Keywords: language model, llm, agent
Abstract: Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at this https URL.
摘要：运动传感器时间序列是人类活动识别（HAR）的核心，以及在健康，运动和智能设备中的应用。但是，现有的方法是针对固定活动组的培训，并且在出现新的行为或传感器设置时需要昂贵的重新培训。最近尝试使用大型语言模型（LLM）进行HAR的尝试，通常是通过将信号转换为文本或图像的尝试，其准确性有限，并且缺乏可验证的可解释性。我们提出了Zara，这是第一个基于代理的零射击框架，直接从原始运动时间序列中解释了HAR。 Zara集成了一个自动得出的配对特征知识库，该知识基础可捕获每个活动对的判别统计数据，一个浮出水面证据的多传感器检索模块以及一个将LLM引导到迭代的特征，借鉴这些证据，并产生活动预测和自然语言解释的层次制剂管道。 Zara可以启用灵活和可解释的HAR，而无需任何微调或特定于任务的分类器。对8个HAR基准测试的广泛实验表明，Zara实现了Sota零射击性能，在宏F1中超过2.53倍的基线，提供了明显的推理。消融研究进一步证实了每个模块的必要性，将Zara标记为朝着值得信赖的，即插即用的运动时间序列分析的有希望的步骤。我们的代码可在此HTTPS URL上找到。

Title: Large Reasoning Models Are Autonomous Jailbreak Agents

Authors: Thilo Hagendorff, Erik Derner, Nuria Oliver
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.04039
Pdf URL: https://arxiv.org/pdf/2508.04039
Copy Paste: [[2508.04039]] Large Reasoning Models Are Autonomous Jailbreak Agents(https://arxiv.org/abs/2508.04039)
Keywords: prompt, agent
Abstract: Jailbreaking -- bypassing built-in safety mechanisms in AI models -- has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.
摘要：越狱 - 绕过AI模型中的内置安全机制 - 传统上需要复杂的技术程序或专门的人类专业知识。在这项研究中，我们表明，大型推理模型（LRMS）的有说服力的能力简化并扩展了越狱，将其转换为非专家可以访问的廉价活动。我们评估了四个LRM（DeepSeek-R1，Gemini 2.5 Flash，Grok 3 Mini，Qwen3 235b）的功能，充当自主对手进行多转向对话的自主对手，并使用九个广泛使用的目标模型。 LRMS通过系统提示收到了指示，然后在不进一步监督的情况下进行计划和执行越狱。我们通过有害提示的基准进行了广泛的实验，该提示由70个项目组成，涵盖了七个敏感域。该设置在所有模型组合中均达到97.14％的总体攻击成功率。我们的研究揭示了对齐回归的回归，其中LRM可以系统地侵蚀其他模型的安全护栏，强调迫切需要进一步使边境模型不仅与越狱尝试相抵触，还可以防止它们被选为越狱代理。

Title: DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation

Authors: Jiabing Yang, Yixiang Chen, Zichen Wen, Chenhang Cui, Peiyan Li, Yuan Xu, Bowen Fang, Yan Huang, Liang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04047
Pdf URL: https://arxiv.org/pdf/2508.04047
Copy Paste: [[2508.04047]] DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation(https://arxiv.org/abs/2508.04047)
Keywords: prompt
Abstract: Controllable Text Generation (CTG) is a vital subfield in Natural Language Processing (NLP), aiming to generate text that aligns with desired attributes. However, previous studies commonly focus on the quality of controllable text generation for short sequences, while the generation of long-form text remains largely underexplored. In this paper, we observe that the controllability of texts generated by the powerful prefix-based method Air-Decoding tends to decline with increasing sequence length, which we hypothesize primarily arises from the observed decay in attention to the prefixes. Meanwhile, different types of prefixes including soft and hard prefixes are also key factors influencing performance. Building on these insights, we propose a lightweight and effective framework called Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding for controllable text generation. Specifically, it first selects the optimal prefix type for a given task. Then we dynamically amplify the attention to the prefix for the attribute distribution to enhance controllability, with a scaling factor growing exponentially as the sequence length increases. Moreover, based on the task, we optionally apply a similar augmentation to the original prompt for the raw distribution to balance text quality. After attribute distribution reconstruction, the generated text satisfies the attribute constraints well. Experiments on multiple CTG tasks demonstrate that DTPA generally outperforms other methods in attribute control while maintaining competitive fluency, diversity, and topic relevance. Further analysis highlights DTPA's superior effectiveness in long text generation.
摘要：可控文本生成（CTG）是自然语言处理（NLP）的重要子场，旨在生成与所需属性保持一致的文本。但是，以前的研究通常集中于短序列可控文本的质量，而长期文本的产生基本上仍然没有被逐渐倍增。在本文中，我们观察到，基于强大的前缀方法空气编码产生的文本的可控性倾向于随着序列长度的增加而下降，我们假设这主要是由于观察到的对前缀的衰减而产生的。同时，包括软和硬前缀在内的不同类型的前缀也是影响性能的关键因素。在这些见解的基础上，我们提出了一个基于可控文本生成的空气编码的轻巧有效的框架，称为动态令牌前缀增强（DTPA）。具体而言，它首先为给定任务选择最佳前缀类型。然后，我们动态地放大了对属性分布的前缀的注意，以增强可控性，随着序列长度的增加，缩放系数呈指数增长。此外，根据任务，我们选择将类似的扩展应用于原始提示，以平衡文本质量。属性分布重建后，生成的文本可以很好地满足属性约束。在多个CTG任务上进行的实验表明，DTPA通常在属性控制中的其他方法在保持竞争性流利，多样性和主题相关性时表现出色。进一步的分析强调了DTPA在长期生成中的卓越有效性。

Title: PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG

Authors: Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04057
Pdf URL: https://arxiv.org/pdf/2508.04057
Copy Paste: [[2508.04057]] PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG(https://arxiv.org/abs/2508.04057)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has become a cornerstone technique for enhancing large language models (LLMs) with external knowledge. However, current RAG systems face two critical limitations: (1) they inefficiently retrieve information for every query, including simple questions that could be resolved using the LLM's parametric knowledge alone, and (2) they risk retrieving irrelevant documents when queries contain sparse information signals. To address these gaps, we introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS), a training-free framework that integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information. Specifically, PAIRS employs a dual-path generation mechanism: First, the LLM produces both a direct answer and a context-augmented answer using self-generated pseudo-context. When these outputs converge, PAIRS bypasses external retrieval entirely, dramatically improving the RAG system's efficiency. For divergent cases, PAIRS activates a dual-path retrieval (DPR) process guided by both the original query and self-generated contextual signals, followed by an Adaptive Information Selection (AIS) module that filters documents through weighted similarity to both sources. This simple yet effective approach can not only enhance efficiency by eliminating unnecessary retrievals but also improve accuracy through contextually guided retrieval and adaptive information selection. Experimental results on six question-answering (QA) benchmarks show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines on average.
摘要：检索演示的一代（RAG）已成为具有外部知识增强大型语言模型（LLM）的基石技术。但是，当前的抹布系统面临两个关键局限性：（1）它们对每个查询的信息效率低下，包括单独使用LLM的参数知识可以解决的简单问题，（2）当查询包含稀疏信息信号时，它们可能会检索无关紧要的文档。为了解决这些差距，我们引入了参数验证的自适应信息检索和选择（Pairs），这是一个无训练的框架，集成了参数并检索知识以自适应地确定是否检索以及如何选择外部信息。具体而言，Pairs采用双路径生成机制：首先，LLM使用自生成的伪封词产生直接答案和上下文启动的答案。当这些输出收敛时，Pairs将完全绕过外部检索，从而显着提高了破布系统的效率。对于发散的情况，Pairs激活了由原始查询和自我生成的上下文信号引导的双路检索（DPR）过程，然后是自适应信息选择（AIS）模块，该模块通过与两个来源的加权相似性过滤文档过滤文档。这种简单而有效的方法不仅可以通过消除不必要的检索来提高效率，而且还可以通过上下文指导的检索和自适应信息选择提高准确性。六个提问（QA）基准测试的实验结果表明，成对将检索成本降低了约25％（仅触发75％的查询），同时仍然提高准确性方面的 +1.1％EM和 +1.0％的F1，而不是平均水平。

Title: Efficient Strategy for Improving Large Language Model (LLM) Capabilities

Authors: Julián Camilo Velandia Gutiérrez
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04073
Pdf URL: https://arxiv.org/pdf/2508.04073
Copy Paste: [[2508.04073]] Efficient Strategy for Improving Large Language Model (LLM) Capabilities(https://arxiv.org/abs/2508.04073)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master's thesis in Systems and Computer Engineering titled "Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)".
摘要：大型语言模型（LLM）已成为人工智能和自然语言处理领域的一个里程碑。但是，他们的大规模部署仍受到对大量计算资源的需求的限制。这项工作从基本模型开始，探索和结合数据处理以及仔细的数据选择技术，培训策略和建筑调整，以提高LLM在资源受限环境中以及在界定知识库中的效率。方法学方法包括定义构建可靠数据集的标准，进行不同配置的受控实验，并系统地评估所得变体，以功能，多功能性，响应时间和安全性。最后，进行了比较测试，以衡量开发变体的性能并验证拟议策略的有效性。这项工作基于硕士学位在系统和计算机工程学中的论文，标题为“改善大语模型（LLMS）功能的有效策略”。

Title: ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

Authors: Zhongyi Zhou, Kohei Uehara, Haoyu Zhang, Jingtao Zhou, Lin Gu, Ruofei Du, Zheng Xu, Tatsuya Harada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04086
Pdf URL: https://arxiv.org/pdf/2508.04086
Copy Paste: [[2508.04086]] ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"(https://arxiv.org/abs/2508.04086)
Keywords: llm, agent
Abstract: Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like DFS. This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-5k, a dataset generated with more complex tool use, lower cost, and 100% pass rate. Experiments show that models trained on ToolGrad-5k outperform those on expensive baseline datasets and proprietary LLMs, even on OOD benchmarks.
摘要：先前的工作通过首先生成用户查询来合成工具使用LLM数据集，然后进行复杂的工具使用注释（例如DFS）。这导致不可避免的注释失败和数据生成效率低。我们介绍了Toolgrad，这是一个颠倒此范式的代理框架。 ToolGrad首先通过以文本“渐变”为指导的迭代过程构造有效的工具使用链，然后合成相应的用户查询。这种“答案优先”方法导致了Toolgrad-5K，这是一种具有更复杂的工具使用，较低成本和100％通过率的数据集。实验表明，在Toolgrad-5K上训练的模型胜过昂贵的基线数据集和专有LLM的模型，即使在OOD基准上也是如此。

Title: GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Authors: Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04088
Pdf URL: https://arxiv.org/pdf/2508.04088
Copy Paste: [[2508.04088]] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning(https://arxiv.org/abs/2508.04088)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM's generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.
摘要：多模式的大语言模型（MLLM）表现出了显着的功能，但通常在复杂的多步数学推理中挣扎，在这种情况下，视觉感知或逻辑推论的小错误可能会导致完全失败。虽然流程奖励模型（PRMS）提供逐步监督，但现有的多模式PRM仅限于可以识别但无法纠正错误，几乎没有解释的二进制验证者。为了解决这些缺陷，我们介绍了生成的多模式过程奖励模型（GM-PRM），这是一种新颖的范式，将PRM从被动法官转变为积极的推理合作者。 GM-PRM不是简单的标量分数，而是对每个推理步骤进行细粒度，可解释的分析，评估其步骤意图，视觉对齐和逻辑声音。更重要的是，GM-PRM经过训练，可以生成其识别的第一个错误步骤的更正版本。这种独特的纠正能力使我们的新测试时间推理策略（精制bon）（精制杆）。该框架通过使用PRM生成的校正来积极提高解决方案质量，以指导政策模型朝着更有希望的推理轨迹，从而提高解决方案池的多样性和正确性。我们证明，GM-PRM在多个多模式数学基准上取得了最先进的结果，从而显着提高了策略模型的性能，具有出色的数据效率，仅需要20K样本的培训数据集。我们的代码将在接受后发布。

Title: Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

Authors: Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04117
Pdf URL: https://arxiv.org/pdf/2508.04117
Copy Paste: [[2508.04117]] Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks(https://arxiv.org/abs/2508.04117)
Keywords: language model, llm
Abstract: The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.
摘要：预处理的大语言模型（LLMS）用标记的数据进行了审核，以提供更好的指导，以遵循人类价值观的能力和一致性。在本文中，我们研究了在推理任务上LLM Finetuning的学习动力学，并揭示了在LLM Finetuning的特定阶段中发现的过度熟悉现象。在此阶段，LLM的记忆性训练数据过多，并且表现出高测试困惑，同时保持良好的测试准确性。我们调查导致LLM过度MOREATION的条件，发现培训时期和大型学习率有助于这个问题。尽管过度熟悉的模型表现出与正常模型相当的测试精度，但它们的鲁棒性降低，分布外概括较差和发电多样性的降低。我们的实验揭示了过度MORIATION，将在不同的任务，模型和填充方法中广泛适用。我们的研究强调，过度参数化，广泛的易月LLM表现出独特的学习动态，与传统的机器学习模型不同。根据我们对过度MORIATION的观察，我们提供了关于鉴定期间检查点和学习率选择的建议。

Title: Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Authors: Xuan Qi, Rongwu Xu, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04149
Pdf URL: https://arxiv.org/pdf/2508.04149
Copy Paste: [[2508.04149]] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap(https://arxiv.org/abs/2508.04149)
Keywords: language model, llm
Abstract: Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.
摘要：将大型语言模型（LLM）与人类偏好保持一致是AI研究的关键挑战。尽管诸如从人类反馈（RLHF）和直接偏好优化（DPO）中学习的方法虽然被广泛使用，但它们通常依赖于大型，昂贵的偏好数据集。当前的工作缺乏专门用于偏好数据的高质量数据选择的方法。在这项工作中，我们引入了一种基于DPO隐式奖励机制的偏好数据集的新型基于难度的数据选择策略。通过选择具有较小DPO隐式奖励差距的偏好数据示例，这些奖励差距表明更具挑战性的情况，我们可以提高数据效率和模型对齐方式。我们的方法始终优于多个数据集和对齐任务的五个强基础，仅使用10 \％的原始数据来实现卓越的性能。这种有原则的，有效的选择方法提供了一种有希望的解决方案，以缩放LLM对准有限的资源。

Title: Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity

Authors: Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04182
Pdf URL: https://arxiv.org/pdf/2508.04182
Copy Paste: [[2508.04182]] Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity(https://arxiv.org/abs/2508.04182)
Keywords: language model, llm, hallucination
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations--generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token's standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.
摘要：多模式的大型语言模型（MLLM）表现出跨视觉任务的令人印象深刻的功能。但是，它们可能会遭受幻觉的困扰，这会导致语义上与输入图像或文本不一致的输出。通过因果分析，我们发现：（i）遗漏的幻觉可能是由于未能充分捕获基本因果因素而引起的，并且（ii）幻觉可能是由于模型被非因果关系误导而引起的。为了应对这些挑战，我们提出了一个以因果完整性为指导的新型加强学习框架，该框架共同考虑了代币的因果和因果的必要性。具体而言，我们评估了每个令牌的独立贡献和反事实，以定义令牌级别的因果完整性奖励。该奖励用于在GRPO优化框架内构建因果知情的优势功能，鼓励模型专注于既有因果关系的代币，既足够了，又是准确生成所必需的。各种基准数据集和任务的实验结果证明了我们方法的有效性，这有效地减轻了MLLM中的幻觉。

Title: Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Authors: Siddhant Panpatil, Hiskias Dingeto, Haon Park
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.04196
Pdf URL: https://arxiv.org/pdf/2508.04196
Copy Paste: [[2508.04196]] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models(https://arxiv.org/abs/2508.04196)
Keywords: language model, gpt, llm
Abstract: Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.
摘要：尽管对齐技术取得了重大进展，但我们证明了最先进的语言模型仍然容易受到精心制作的对话情景的攻击，这些场景可能会引起各种形式的未对准而无需明确的越狱。通过使用Claude-4-Opus进行系统的手动红色团队，我们发现了10个成功的攻击场景，在当前一致性方法如何处理叙事沉浸，情感压力和战略框架方面揭示了基本脆弱性。这些场景成功地引起了一系列未对准的行为，包括欺骗，价值漂移，自我保护和操纵性推理，每个行为都利用了不同的心理和上下文脆弱性。为了验证可推广性，我们将成功的手动攻击提炼为未对准的底座，这是一个自动化的评估框架，可在多个模型中进行可重现的测试。对我们的10个方案对五个Frontier LLM的跨模型评估表明，总体76％的脆弱性率，差异很大：GPT-4.1显示出最高的敏感性（90％），而Claude-4-Sonnet表现出更大的耐药性（40％）。我们的发现表明，复杂的推理能力通常会成为攻击向量而不是保护机制，因为可以将模型操纵为未对准的行为的复杂理由。这项工作提供了（i）对话操作模式的详细分类学和（ii）可重复使用的评估框架。这些发现共同揭示了当前一致性策略中的关键差距，并强调了对未来AI系统中基于微妙的基于情况的操纵的鲁棒性的需求。

Title: Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

Authors: Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Maina, Keshet Ronen, Javier Gonzalez, Jacki O'Neill
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04199
Pdf URL: https://arxiv.org/pdf/2508.04199
Copy Paste: [[2508.04199]] Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts(https://arxiv.org/abs/2508.04199)
Keywords: language model, llm
Abstract: Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.
摘要：低资源，文化细微差别的环境中的情感分析挑战了常规的NLP方法，这些方法采用固定标签和普遍的情感表达方式。我们提出了一个诊断框架，该框架将情感视为上下文依赖的文化嵌入构造，并评估了内罗毕青年健康组织中关于非正式，代码混合的WhatsApp消息中关于情感的大型语言模型（LLMS）的原因。使用人类通知的数据，情绪触发的反事实和基于栏目的解释评估的结合，我们探测了LLM的解释性，鲁棒性和与人类推理的一致性。通过社会科学测量镜头构建我们的评估，我们将LLMS输出作为衡量抽象情感概念的工具进行操作和询问。我们的发现表明，模型推理质量的显着差异，顶级LLM表现出解释性稳定性，而开放模型通常会在歧义或情感转移的情况下步履蹒跚。这项工作强调了对复杂，现实世界中对文化敏感的，推理意识的AI评估的需求。

Title: Hierarchical Text Classification Using Black Box Large Language Models

Authors: Kosuke Yoshimura, Hisashi Kashima
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04219
Pdf URL: https://arxiv.org/pdf/2508.04219
Copy Paste: [[2508.04219]] Hierarchical Text Classification Using Black Box Large Language Models(https://arxiv.org/abs/2508.04219)
Keywords: language model, llm, prompt
Abstract: Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies -- Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.
摘要：分层文本分类（HTC）旨在将文本分配给结构化标签层次结构；但是，由于数据稀缺性和模型复杂性，它面临着挑战。这项研究探讨了使用API访问HTC的黑匣子大语模型（LLM）的可行性，这是需要广泛标记的数据和计算资源的传统机器学习方法的替代性。我们评估了三种提示策略 - 直接叶标签预测（DL），直接层次标签预测（DH）和自上而下的多步分层标签预测（TMH） - 在零射击和少量设置中，以比较这些策略的准确性和成本效益。两个数据集上的实验表明，与零射击设置相比，一些弹出设置始终提高分类精度。尽管传统的机器学习模型在具有浅层层次结构的数据集上达到了高精度，但LLM，尤其是DH策略，倾向于在具有更深层次层次结构的数据集中胜过机器学习模型。由于DH策略的更深层标签层次结构所需的较高的输入令牌，API成本大大增加。这些结果强调了准确性提高与及时策略的计算成本之间的权衡。这些发现突出了黑匣子LLM对HTC的潜力，同时强调需要仔细选择及时策略以平衡性能和成本。

Title: DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting

Authors: Chanjuan Liu (1), Shengzhi Wang (2), Enqiang Zhu (2) ((1) School of Computer Science and Technology, Dalian University of Technology, Dalian, China,(2) Institute of Computing Technology, Guangzhou University, Guangzhou, China)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04239
Pdf URL: https://arxiv.org/pdf/2508.04239
Copy Paste: [[2508.04239]] DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting(https://arxiv.org/abs/2508.04239)
Keywords: language model, gpt, prompt
Abstract: Time series forecasting is crucial in strategic planning and decision-making across various industries. Traditional forecasting models mainly concentrate on numerical time series data, often overlooking important textual information such as events and news, which can significantly affect forecasting accuracy. While large language models offer a promise for integrating multimodal data, existing single-prompt frameworks struggle to effectively capture the semantics of timestamped text, introducing redundant information that can hinder model performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework that combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data. The tokenizer generates the explicit prompt while the embeddings from the textual prompt are refined through self-attention and feed-forward networks. Comprehensive experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting. This highlights the significance of incorporating textual context via a dual-prompt mechanism to achieve more accurate time series predictions.
摘要：时间序列预测对于各个行业的战略规划和决策至关重要。传统的预测模型主要集中于数值时间序列数据，通常会忽略重要的文本信息，例如事件和新闻，这可能会严重影响预测准确性。尽管大型语言模型为整合多模式数据提供了希望，但现有的单prompt框架很难有效地捕获时间戳文本的语义，从而引入了冗余信息，从而阻碍了模型性能。为了解决这一限制，我们介绍了DP-GPT4MTS（多模式时间序列的双提点GPT2-BASE），这是一个新颖的双启示大型语言模型框架，结合了两个互补的提示：明确提示了清晰的任务说明和文本的提示，可以从时光贴到时型固定数据。令牌器会生成显式提示，而文本提示的嵌入通过自我注意和馈送前向网络来完善。对各种纹理数量时间序列数据集进行的全面实验表明，这种方法在预测的时间序列中优于最先进的算法。这突出了通过双提出机制结合文本上下文以实现更准确的时间序列预测的重要性。

Title: TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening

Authors: Xi Wang, Anxo Perez, Javier Parapar, Fabio Crestani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04248
Pdf URL: https://arxiv.org/pdf/2508.04248
Copy Paste: [[2508.04248]] TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening(https://arxiv.org/abs/2508.04248)
Keywords: language model, llm
Abstract: The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems.
摘要：对精神卫生服务的需求不断增长，超过了实际培训数据来发展临床专业人员，从而导致对抑郁症的诊断有限。这种短缺激发了模拟或虚拟患者的发展以协助培训和评估，但是现有方法通常无法产生临床上有效，自然和多样化的症状表现。在这项工作中，我们将最近的高级语言模型作为主干模型，并提出了一个新型的临床医生模拟管道，TalkDep，并访问了多元化的患者概况以发展模拟患者。通过根据精神病诊断标准，症状严重程度量表和上下文因素来调节模型，我们的目标是创建真实的患者反应，以更好地支持诊断模型培训和评估。我们通过临床专业人员进行了彻底评估来验证这些模拟患者的可靠性。经过验证的模拟患者的可用性为改善自动抑郁诊断系统的鲁棒性和普遍性提供了可扩展和适应能力的资源。

Title: KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

Authors: Zunhai Su, Kehong Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04257
Pdf URL: https://arxiv.org/pdf/2508.04257
Copy Paste: [[2508.04257]] KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs(https://arxiv.org/abs/2508.04257)
Keywords: language model, llm
Abstract: Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks and KV cache quantization. Based on our enhanced understanding, we introduce \textit{\textbf{KVSink}}, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation. Extensive experiments demonstrate that KVSink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during KV cache quantization. Moreover, when applied to the well-established KVQuant method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers.
摘要：键值（KV）缓存量化已通过减少KV缓存存储器的使用并减轻内存结合的约束，已成为有效大型语言模型（LLMS）推断的广泛采用的优化技术。最近的研究强调了保留KVS原始精度对于前几个令牌的重要性，以确保对注意力集的保护。尽管这种方法已被证明可以有效缓解绩效降解，但其基本原则仍然不足以理解。此外，它无法解决最近发现的发现，即关注点可能超出最初的令牌位置。在这项工作中，我们通过研究了它们在极端激活异常值的跨层进化中的作用，从而阐明了注意力下沉的基本机制。此外，我们对注意力集和KV缓存量化之间的相互作用进行了全面分析。基于我们的增强理解，我们介绍了一种插件和播放方法\ textIt {\ textbf {kvSink}}，这是一种插件和播放方法，可有效地预测以可忽略不计的开销，从而更彻底地保存。广泛的实验表明，KVSink的表现优于现有的第一-n（PFN）策略，在KV缓存量化过程中提供了更有效的注意下沉的保存。此外，当应用于公认的Kvquant方法时，KVSink进一步改善了困惑（PPL）并减少对16位数值异常值的依赖。

Title: ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Authors: Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jiandong Zhang, Xiaoyi Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04266
Pdf URL: https://arxiv.org/pdf/2508.04266
Copy Paste: [[2508.04266]] ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents(https://arxiv.org/abs/2508.04266)
Keywords: gpt, llm, agent
Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.
摘要：电子商务中的现有基准主要关注基本用户意图，例如查找或购买产品。但是，现实世界中的用户经常实现更复杂的目标，例如应用凭证，管理预算和寻找多产品卖家。为了弥合这一差距，我们提出了一种新颖的端到端购物基准，旨在涵盖越来越具有挑战性的基础意图。具体而言，我们提出了一个可扩展的框架，以基于从采样现实世界中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估，我们提供了一个大规模的购物沙箱，该沙箱充当交互式模拟环境，融合了超过250万个现实世界中的产品。实验结果表明，即使是最先进的语言代理商（例如GPT-4.1），在我们的基准任务中达到了50％的绝对成功率，这突出了我们的购物台面临的重大挑战。此外，我们提出了一种轨迹蒸馏策略，并利用受监督的微调以及对合成轨迹的强化学习，将大型语言代理的能力提炼成较小的语言。结果，与GPT-4.1相比，我们训练有素的代理商达到了竞争性能。

Title: A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

Authors: Jiayi Wen, Tianxin Chen, Zhirun Zheng, Cheng Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04276
Pdf URL: https://arxiv.org/pdf/2508.04276
Copy Paste: [[2508.04276]] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models(https://arxiv.org/abs/2508.04276)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1\%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05\% of full text modified, the QA accuracy collapses from 95\% to 50\%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.
摘要：基于图基的检索增强生成（GraphRag）最近已成为通过将原始文本转换为结构化知识图，提高准确性和解释性的有希望增强大型语言模型（LLM）的有希望的范式。但是，GraphRag依靠LLM在图形构造过程中从原始文本中提取知识，并且可以通过恶意操纵此过程以植入误导信息。针对此攻击表面，我们提出了两次知识中毒攻击（KPA），并证明仅修改源文本中的几个单词可以显着改变构造的图形，毒化图形，并严重误导下游推理。命名为目标KPA（TKPA）的第一次攻击利用图理论分析在生成的图中找到脆弱的节点，并使用LLMS重写相应的叙述，从而以93.1 \％的成功率实现了对特定问题的问题（QA）结果的精确控制，同时保持有毒的文本和自然文本和自然。第二次攻击称为通用KPA（UKPA），利用语言提示，例如代词和依赖关系，以通过改变全球影响的词来破坏生成图的结构完整性。经过修改的全文的不到0.05％，质量质量准确性从95 \％塌陷至50 \％。此外，实验表明，最先进的防御方法无法检测到这些攻击，这强调了将GraphRag管道免受知识中毒仍然没有探索。

Title: Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Authors: Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04325
Pdf URL: https://arxiv.org/pdf/2508.04325
Copy Paste: [[2508.04325]] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models(https://arxiv.org/abs/2508.04325)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.
摘要：大型语言模型（LLMS）在医疗保健中显示出巨大的潜力，促使许多基准测试以评估其功能。但是，人们对这些基准的可靠性持续存在，这些基准通常缺乏临床保真度，强大的数据管理和面向安全的评估指标。为了解决这些缺点，我们介绍了Medcheck，这是第一个针对医疗基准设计的面向生命周期的评估框架。我们的框架将基准的开发解构为从设计到治理的五个连续阶段，并提供了46个医疗标准的全面清单。使用Medcheck，我们对53个医学LLM基准进行了深入的经验评估。我们的分析发现了广泛的，系统性的问题，包括与临床实践的明显脱节，由于无法避免的污染风险而引起的数据完整性危机以及对安全关键评估维度（如模型鲁棒性和不确定性意识）的系统忽略。基于这些发现，Medcheck既是现有基准测试的诊断工具，也可以作为可行的指南，以培养一种更具标准化，可靠和透明的方法来评估医疗保健中的AI。

Title: Modelling and Classifying the Components of a Literature Review

Authors: Francisco Bolaños, Angelo Salatino, Francesco Osborne, Enrico Motta
Subjects: cs.CL, cs.AI, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2508.04337
Pdf URL: https://arxiv.org/pdf/2508.04337
Copy Paste: [[2508.04337]] Modelling and Classifying the Components of a Literature Review(https://arxiv.org/abs/2508.04337)
Keywords: language model, gpt, llm
Abstract: Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96\% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.
摘要：先前的工作表明，分析科学文献的AI方法可从论文中的句子中根据其修饰作用，例如研究差距，结果，局限性，现有方法的扩展等。这种表示也有可能支持开发能够产生高质量文献综述的新一代系统。但是，实现这一目标需要定义相关的注释模式和文献大规模注释的有效策略。本文通过1）提出了一种专门设计的新颖注释模式来解决这些挑战，专门针对文献审查的生成和2）对根据本架构对修辞角色进行分类时对广泛的最先进的大语言模型（LLM）进行全面评估。为此，我们还提出了科幻句子，这是一种新型的多学科基准，其中包含700个由域专家手动注释的句子和2240个使用LLMS自动标记的句子。我们使用零拍学习和微调方法评估了该基准测试的37个LLM，涵盖了不同的模型系列和大小。这些实验产生了几种新颖的见解，可以在这个充满挑战的领域中推动最新技术。首先，当对高质量数据进行微调时，当前一代的LLM在这项任务上表现出色，达到96 \％F1的性能水平。其次，尽管GPT-4O等大型专有模型取得了最佳效果，但一些轻巧的开源替代品也表现出出色的性能。最后，通过LLMS产生的半合成示例丰富训练数据证明有益，从而使小型编码器能够获得强大的结果并显着增强了几种开放解码器模型的性能。

Title: GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Authors: Hongze Tan, Jianfei Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04349
Pdf URL: https://arxiv.org/pdf/2508.04349
Copy Paste: [[2508.04349]] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy(https://arxiv.org/abs/2508.04349)
Keywords: language model, llm
Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.
摘要：具有诸如小组相对政策优化（GRPO）之类的算法的强化学习（RL）改善了大语言模型（LLM）推理，但受到粗糙的信用分配的限制，该信用分配在顺序中适用于所有代币。这是长链推理任务的主要缺陷。本文用\ textbf {动态熵加权}解决此问题。我们的核心想法是，正确响应中的高凝集令牌可以指导该政策朝着更高的性能上限。这使我们能够通过两种方式为精确的策略更新创建更多细粒度的奖励信号：1）\ textbf {组令牌策略优化}（\ textbf {gtpo}），我们为每个令牌分配了一个entrypoy-grountoder-tempropedy的奖励。 2）\ textbf {sequence-level组相对策略优化}（\ textbf {grpo-s}），我们根据其平均令牌熵为每个序列分配熵加权奖励。实验表明，我们的方法的表现明显优于强大的DAPO基线。结果证实，我们的熵加权机制是这种性能提升的关键动力，为增强模型中的深层推理提供了更好的途径。

Title: Chain of Questions: Guiding Multimodal Curiosity in Language Models

Authors: Nima Iji, Kia Dashtipour
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2508.04350
Pdf URL: https://arxiv.org/pdf/2508.04350
Copy Paste: [[2508.04350]] Chain of Questions: Guiding Multimodal Curiosity in Language Models(https://arxiv.org/abs/2508.04350)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model's ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.
摘要：大型语言模型（LLM）中的推理能力通过诸如经过思考链和明确的逐步解释之类的方法大大提高。但是，这些改进尚未完全过渡到多模式上下文，其中模型必须主动确定在与复杂的现实世界环境交互时，诸如视觉，音频或空间感知之类的感觉方式。在本文中，我们介绍了一个问题链（COQ）框架，这是一种好奇的推理方法，鼓励多模式模型动态生成有关其周围环境的目标问题。这些生成的问题指导模型选择性激活相关方式，从而收集准确推理和响应生成所需的关键信息。我们通过集成WebGPT，ScienceQA，AVSD和SCANQA数据集组装的新型多模式基准数据集评估了我们的框架。实验结果表明，我们的COQ方法提高了基础模型有效识别和整合相关感官信息的能力。这导致了通过多种多模式任务的推理过程提高准确性，可解释性和对齐方式。

Title: AIC CTU@FEVER 8: On-premise fact checking through long context RAG

Authors: Herbert Ullrich, Jan Drchal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04390
Pdf URL: https://arxiv.org/pdf/2508.04390
Copy Paste: [[2508.04390]] AIC CTU@FEVER 8: On-premise fact checking through long context RAG(https://arxiv.org/abs/2508.04390)
Keywords: long context
Abstract: In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year's submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single NVidia A10 GPU, 23GB of graphical memory and 60s running time per claim.
摘要：在本文中，我们介绍了事实检查管道，该管道在Fever 8共享任务中取得了第一名。根据我们去年的提交，我们的事实检查系统是一条简单的两步抹布管道。我们展示了如何重新部署本地管道，即使在单个NVIDIA A10 A10 GPU，23GB的图形内存和60年代运行时间的限制下，即使在EV2R测试得分的意义上（在EV2R测试得分的意义上）如何实现最新的事实检查性能。

Title: Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

Authors: Xu Zhang, Mei Chen
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04399
Pdf URL: https://arxiv.org/pdf/2508.04399
Copy Paste: [[2508.04399]] Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky(https://arxiv.org/abs/2508.04399)
Keywords: language model, llm
Abstract: This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP.
摘要：这项研究评估了先进的自然语言处理（NLP）技术，以通过采矿崩溃的叙述来提高崩溃数据质量，并使用肯塔基州的次级崩溃识别作为案例研究。从2015 - 2022年开始手动审查的16,656次手动审查的叙述中，我们进行了3,803次确认的次要崩溃，我们比较了三个模型类别：零弹性开源大型语言模型（LLMS）（LLAMA3：70B，DEEPSEEK-R1：70B：70B，QWEN3：QWEN3：32B，GEMMA3：27B：27B）;微调的变压器（Bert，Distilbert，Roberta，Xlnet，Longformer）；传统的逻辑回归为基线。在2015 - 2021年的数据上校准了模型，并在2022年以来对1,771个叙述进行了测试。微调的变形金刚达到了卓越的性能，罗伯塔的F1得分最高（0.90）和准确性（95％）。零拍的llama3：70b达到了可比的F1，为0.86，但需要进行139分钟的推理；逻辑基线落后于（F1：0.66）。 LLM在召回某些变体方面表现出色（例如，Gemma3：27b为0.94），但产生了高计算成本（DeepSeek-r1：70b最多723分钟），而微调模型在短暂训练后几秒钟以几秒钟的速度处理了测试集。进一步的分析表明，中型LLM（例如，DeepSeek-R1：32B）可以在较大的性能中与较大的同行匹敌，同时降低运行时，这表明了优化部署的机会。结果突出了精度，效率和数据要求之间的权衡，并通过微调的变压器模型平衡精度并有效地对肯塔基州的数据进行了回忆。实际部署的注意事项强调了保护隐私的本地部署，提高准确性的集合方法以及用于可伸缩性的增量处理，提供了可复制的方案，可通过高级NLP提高崩溃数据质量。

Title: Why are LLMs' abilities emergent?

Authors: Vladimír Havlík
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04401
Pdf URL: https://arxiv.org/pdf/2508.04401
Copy Paste: [[2508.04401]] Why are LLMs' abilities emergent?(https://arxiv.org/abs/2508.04401)
Keywords: language model, llm
Abstract: The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of "creation without understanding" that characterises contemporary AI development. We explore how the neural approach's reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components.
摘要：大型语言模型（LLM）在生成任务中的显着成功提出了有关其获得能力的性质的基本问题，这些问题通常在没有明确培训的情况下出乎意料地出现。本文通过理论分析和经验观察研究了深神经网络（DNN）的新兴特性，从而解决了“创造不理解”的认识论挑战，这些挑战是当代AI发展的特征。我们探讨了神经方法对非线性的依赖，随机过程从根本上与符号计算范式有所不同，从而创建了其宏观行为的系统，无法分析从微级别的神经元活动中得出。通过分析缩放定律，grokking现象和模型能力中的相变，我证明了出现的能力是由高度敏感的非线性系统的复杂动力学产生的，而不是仅仅来自参数缩放。我的调查表明，目前关于指标，训练前损失阈值和内在学习学习的辩论错过了DNN中出现的基本本体论性质。我认为，这些系统表现出类似于其他复杂自然现象中发现的系统的真正新兴特性，在这些现象中，系统的能力来自简单组件之间的合作相互作用，而不可简化其个体行为。本文得出的结论是，了解LLM功能需要将DNN识别为受普遍出现原理的复杂动力系统的新领域，类似于物理，化学和生物学中的工作。这种观点将重点从纯粹的现象学定义转变为理解内部动态转换，使这些系统能够获得超越其各个组件的能力。

Title: Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model

Authors: Kiyotada Mori, Seiya Kawano, Angel Fernando Garcia Contreras, Koichiro Yoshino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04403
Pdf URL: https://arxiv.org/pdf/2508.04403
Copy Paste: [[2508.04403]] Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model(https://arxiv.org/abs/2508.04403)
Keywords: language model
Abstract: Prefetching of dialogue responses has been investigated to reduce user-perceived latency (UPL), which refers to the user's waiting time before receiving the system's response, in spoken dialogue systems. To reduce the UPL, it is necessary to predict complete user utterances before the end of the user's speech, typically by language models, to prepare prefetched dialogue responses. In this study, we proposed a prediction confidence model (PCM) that determines whether prefetching is possible or not by estimating the semantic similarity between the predicted complete user utterance and the complete user utterance. We evaluated our PCM based on the differences between the predicted complete user utterance and the complete user utterance.
摘要：对对话响应的预摘要已经进行了研究，以减少用户感知的延迟（UPL），这是指在“口语对话系统”中指出用户在收到系统响应之前的等待时间。为了减少UPL，有必要在用户语音结束前（通常是通过语言模型）预测完整的用户话语，以准备预取的对话响应。在这项研究中，我们提出了一个预测置信模型（PCM），该模型通过估计预测的完整用户话语和完整用户说法之间的语义相似性来确定预摘要是否可能。我们根据预测的完整用户话语和完整的用户话语之间的差异评估了PCM。

Title: Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Authors: Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04423
Pdf URL: https://arxiv.org/pdf/2508.04423
Copy Paste: [[2508.04423]] Evaluating, Synthesizing, and Enhancing for Customer Support Conversation(https://arxiv.org/abs/2508.04423)
Keywords: llm, agent
Abstract: Effective customer support requires not only accurate problem solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at this https URL.
摘要：有效的客户支持不仅需要准确的问题解决，而且还需要与专业标准保持一致的结构和善解人意的沟通。但是，现有的对话数据集通常缺乏战略指导，而现实世界中的服务数据很难访问和注释。为了解决这个问题，我们介绍了客户支持对话（CSC）的任务，旨在培训客户服务代理商使用明确定义的支持策略做出响应。我们提出了一个基于COPC指南的结构化CSC框架，定义了五个对话阶段和十二种指导高质量相互作用的策略。基于此，我们构建了CSCONV，这是1,855个现实世界中客户代理对话的评估数据集，使用LLMS重写以反映故意的策略使用，并相应地注释。此外，我们开发了一种角色扮演方法，该方法使用与CSC框架一致的LLM驱动角色模拟策略丰富的对话，从而导致训练数据集ROLEC。实验表明，在ROLEC上进行微调的LLM显着提高了它们在CSCONV上产生高质量，策略一致的响应的能力。人类评估进一步证实了解决问题的收益。所有代码和数据将在此HTTPS URL上公开可用。

Title: StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion

Authors: Yutong Wu, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang, Chenrui Cao, Lei Qi, Rui Zhang, Zidong Du, Jie Yan, Xing Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04440
Pdf URL: https://arxiv.org/pdf/2508.04440
Copy Paste: [[2508.04440]] StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion(https://arxiv.org/abs/2508.04440)
Keywords: llm
Abstract: Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.
摘要：自动化旨在将自然语言数学陈述转化为形式语言。尽管LLM在这一领域加速了进展，但现有方法仍然患有低准确性。我们确定有效自动化的两个关键能力：全面掌握正式语言领域知识，以及自然语言问题理解和非正式形式一致性的推理能力。没有前者，模型将无法识别正确的形式对象。在没有后者的情况下，它难以解释现实世界的环境并将其精确地映射到形式上。为了解决这些差距，我们介绍了Thinkef，这是一条数据综合和培训管道，可提高这两种能力。首先，我们构建了两个数据集：一个数据集：一个通过蒸馏和选择富含正式知识的大规模示例，另一个是通过生成以专家设计的模板为指导的非正式推理轨迹。然后，我们将SFT和RLVR与这些数据集应用在一起，以进一步融合和完善这两种能力。由此产生的7b和32b模型既表现出综合的形式知识，又表现出强大的非正式推理。值得注意的是，Stepfun-Formalizer-32b在正式lite上获得了40.5％的SOTA BEQ@1分40.5％，而Proverbench的26.7％超过了所有先前的通用和专业模型。

Title: Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

Authors: Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman, Syahmi Akmal Shaharudin, Muhammad Danial Jupikil, Iqqwan Jasman Su Azlan Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04442
Pdf URL: https://arxiv.org/pdf/2508.04442
Copy Paste: [[2508.04442]] Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI(https://arxiv.org/abs/2508.04442)
Keywords: gpt, prompt, retrieval-augmented generation
Abstract: This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.
摘要：本文解决了马来西亚教育系统中对可扩展和高质量的教育评估工具的关键需求。它突出了生成AI（Genai）的潜力，同时承认确保事实准确性和课程对齐的重大挑战，尤其是对于像Bahasa Melayu这样的低资源语言。这项研究介绍并比较了使用OpenAI的GPT-4O在Bahasa Melayu生成表格1数学多项选择问题（MCQ）的四个增量管道。该方法范围从非接地提示（结构化和基本）到检索型生成（RAG）方法（一种使用Langchain框架，一种手动实施）。该系统基于官方课程文件，包括准备教师的笔记和年度教学计划（RPT）。采用双管自动评估框架来评估生成的问题。使用语义文本相似性（STS）对RPT进行了课程对齐方式，而通过新颖的基于抹布的问题驱动（RAG-QA）方法来验证上下文有效性。结果表明，基于抹布的管道显着优于非基础的提示方法，从而产生了更高的课程比对和事实有效性的问题。该研究进一步分析了基于框架的抹布的易于实施与手动管道提供的细粒度控制之间的权衡。这项工作提出了一种经过验证的方法，用于以低资源语言生成课程特定的教育内容，引入了共生的RAG-QA评估技术，并为马来西亚和类似地区的实用EDTECH解决方案的开发和部署提供了可行的见解。

Title: CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation

Authors: Bastien Liétard, Gabriel Loiseau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04494
Pdf URL: https://arxiv.org/pdf/2508.04494
Copy Paste: [[2508.04494]] CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation(https://arxiv.org/abs/2508.04494)
Keywords: language model
Abstract: Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE's fine-tuning brings valuable changes to the spatial organization of embeddings.
摘要：词汇语义与单词可以在不同的上下文中所采用的多种感觉以及不同单词含义之间存在的语义关系有关。为了调查它们，上下文化的语言模型是一种有价值的工具，可提供上下文敏感的表示，可用于研究词汇含义。诸如XL-Lexeme之类的最新作品利用了文字上下文的任务来微调它们以获取更准确的表述，但是文字上下文仅比较了同一引理的出现，从而限制了捕获的信息的范围。在本文中，我们提出了一个扩展，概念差异化，以包括词间场景。我们为此任务提供了一个数据集，该数据集源自SEMCOR数据。然后，我们在此数据集上微调了几个表示模型。我们将这些模型称为概念一致的嵌入式（CALE）。通过挑战我们的模型和其他模型，在各种词汇上的语义任务上，我们证明了所提出的模型提供了有效的词汇含义的多功能表示，可以在我们的实验中达到最佳性能。我们还表明，Cale的微调为嵌入的空间组织带来了宝贵的变化。

Title: StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering

Authors: Chenglei Shen, Zhongxiang Sun, Teng Shi, Xiao Zhang, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04530
Pdf URL: https://arxiv.org/pdf/2508.04530
Copy Paste: [[2508.04530]] StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering(https://arxiv.org/abs/2508.04530)
Keywords: language model, llm
Abstract: Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model's core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model's representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.
摘要：通过表示编辑生成风格化的大语言模型（LLM）响应是一种进行细粒输出控制的有前途的方法。但是，存在一个固有的权衡：强加独特的风格常常会降低真实性。现有的表示编辑方法通过天真注入样式信号，忽略这种附带影响并经常污染模型的核心真实性表示，从而减少了答案正确性。我们认为这种现象导致真实性崩溃。我们将这个问题归因于某些关键注意力的样式和真理方向之间的潜在耦合，并提出了Stylitruth，该机制可以保留风格，同时保持真实性完好无损。 Stylitruth通过正交放气过程将模型表示空间中的风格相关和与真实相关的子空间分开。这种分解可以在自己的子空间中独立控制风格和真理，从而最大程度地减少干扰。通过设计每个子空间内的自适应，令牌级转向向量，我们动态而精确地控制了生成过程，以维持风格上的忠诚和真实性。我们在多种样式和语言上验证方法。广泛的实验和分析表明，风格大大降低了风格化引起的真实性崩溃，并且在平衡风格遵守与真实性之间的推理时间干预方法胜过现有的推理时间干预方法。

Title: Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning

Authors: Zhuang Chen, Guanqun Bi, Wen Zhang, Jiawei Hu, Aoyun Wang, Xiyao Xiao, Kun Feng, Minlie Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04531
Pdf URL: https://arxiv.org/pdf/2508.04531
Copy Paste: [[2508.04531]] Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning(https://arxiv.org/abs/2508.04531)
Keywords: llm
Abstract: Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.
摘要：抑郁症是一种广泛的精神障碍，影响了全球数百万。尽管自动抑郁症评估显示出希望，但大多数研究都依赖于有限或非链式验证的数据，并且通常将复杂模型设计优先于现实世界的有效性。在本文中，我们旨在揭示临床抑郁评估的景观。我们介绍了C-Mind，这是一种临床神经精神病学的多模式诊断数据集，该数据集在实际的医院就诊中收集了两年。每个参与者都完成了三项结构化的精神病学任务，并从专业临床医生那里获得了最终的诊断，并记录了信息的音频，视频，成绩单和功能性近红外光谱（FNIRS）信号。使用C-Mind，我们首先分析与诊断相关的行为特征。我们训练一系列古典模型，以量化不同的任务和方式如何促进诊断性能，并剖析其组合的有效性。然后，我们探索LLM是否可以执行诸如临床医生之类的精神病推理，并在现实的临床环境中确定其明显的局限性。作为回应，我们建议通过临床专业知识指导推理过程，并始终如一地提高LLM诊断性能，最高可达10％的宏F1分数。我们旨在从数据和算法的角度建立一个用于临床抑郁症评估的基础设施，以促进C-Mind促进基础和可靠的心理保健研究。

Title: Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

Authors: Nuo Chen, Yicheng Tong, Jiaying Wu, Minh Duc Duong, Qian Wang, Qingyun Zou, Bryan Hooi, Bingsheng He
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.04575
Pdf URL: https://arxiv.org/pdf/2508.04575
Copy Paste: [[2508.04575]] Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration(https://arxiv.org/abs/2508.04575)
Keywords: agent
Abstract: While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.
摘要：尽管AI代理在科学构想中表现出潜力，但大多数现有的框架都依赖于单一的改进，从而限制了知识和观点，从而限制了创造力。受实际研究动力学的启发，本文研究了结构化的多代理讨论是否可以超越单独的构想。我们提出了一个合作的多机构框架，用于生成研究建议，并系统地比较包括小组规模，领导者与无领导者结构以及团队组成在跨学科和资历中各不相同的配置。为了评估思想质量，我们在诸如新颖性，战略视觉和整合深度等方面的基于代理的评分和人类审查中采用了全面的协议。我们的结果表明，多代理讨论基本上要优于孤立基线。指定的领导者充当催化剂，将讨论转变为更融合和远见的建议。值得注意的是，我们发现认知多样性是质量的主要驱动力，但是专业知识是不可谈判的先决条件，因为缺乏高级知识的团队甚至无法超越一个主管的特工。这些发现为设计协作性AI构想系统提供了可行的见解，并阐明了团队结构如何影响创造性成果。

Title: Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Authors: Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04581
Pdf URL: https://arxiv.org/pdf/2508.04581
Copy Paste: [[2508.04581]] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning(https://arxiv.org/abs/2508.04581)
Keywords: language model, llm
Abstract: Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.
摘要：大型语言模型（LLM）彻底改变了AI应用程序，但是它们的高计算和记忆要求阻碍其广泛的部署。现有的压缩技术集中于块内的优化（例如，低级别近似，注意头部修剪），而变压器的重复分层结构意味着显着的块间冗余 - 尺寸在很大程度上未能探索键值（KV）缓存。受到CNN中字典学习的启发，我们为跨变压器层的结构重量共享提供了一个框架。我们的方法将注意力投影矩阵分解为共享字典原子，在实现PAR性能的同时，将注意力模块的参数降低了66.7％。与需要蒸馏或架构变化的复杂方法不同，MASA（矩阵共享注意力集）作为倒入替代品 - 经过标准优化器训练 - 并表示每一层的权重作为共享矩阵原子的线性组合。跨量表（100m-700m参数）的实验表明，MASA比分组 - 问题注意（GQA），低级别基准以及最近在可比较的参数预算下的重复反过来/顺序共享更好的基准准确性和困惑。消融研究证实了对字典大小的鲁棒性，以及共享表示形式在捕获跨层统计规律性方面的功效。 MASA扩展到视觉变压器（VIT），与图像分类和检测任务相匹配，注意参数少66.7％。通过将字典学习策略与变压器效率相结合，MASA为参数有效模型提供了可扩展的蓝图，而无需牺牲性能。最后，我们调查了在预计的LLM上使用MASA减少其参数数量的可能性，而无需大幅度下降其性能。

Title: TURA: Tool-Augmented Unified Retrieval Agent for AI Search

Authors: Zhejun Zhao, Yuehu Dong, Alley Liu, Lixue Zheng, Pingsheng Liu, Dongdong Shen, Long Xia, Jiashu Zhao, Dawei Yin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.04604
Pdf URL: https://arxiv.org/pdf/2508.04604
Copy Paste: [[2508.04604]] TURA: Tool-Augmented Unified Retrieval Agent for AI Search(https://arxiv.org/abs/2508.04604)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.
摘要：大型语言模型（LLM）的出现正在将搜索引擎转换为对话式AI搜索产品，主要是使用Web Corpora上的检索演奏生成（RAG）。但是，该范式具有重大的工业限制。传统的抹布方法与实时需求和结构化查询斗争，这些查询需要访问动态生成的内容，例如票务可用性或库存。搜索引擎仅限于索引静态页面，无法执行此类时间敏感数据所需的交互式查询。学术研究的重点是优化静态内容的抹布，忽略复杂的意图以及对数据库和实时API等动态来源的需求。为了弥合这一差距，我们介绍了Tura（用于AI搜索的工具增强的统一检索代理），这是一个新颖的三阶段框架，将抹布与代理工具使用相结合，以访问静态内容和动态，实时信息。 TURA具有三个关键组件：一个意图检索模块，用于分解查询并检索作为模型上下文协议（MCP）服务器封装的信息源，这是一个基于DAG的任务计划者，该任务依赖性将任务依赖性建模为有向的acyclic Graph（DAG），以实现最佳的平行执行，并提供轻度蒸馏代理商，并获得了良好的蒸馏器执行器，以获得良好的执行工具。 Tura是第一个系统地弥合世界一流AI搜索产品的静态抹布和动态信息源之间差距的架构。它为数千万用户提供服务，它利用一个代理框架来提供强大的实时答案，同时满足大型工业系统的低延迟需求。

Title: Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider

Authors: Chirag Seth, Utkarsh Singh
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.04623
Pdf URL: https://arxiv.org/pdf/2508.04623
Copy Paste: [[2508.04623]] Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider(https://arxiv.org/abs/2508.04623)
Keywords: gpt
Abstract: Text-to-SQL translation enables non-expert users to query relational databases using natural language, with applications in education and business intelligence. This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset, focusing on low-resource settings. We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model's architecture, training them across 1000 to 5000 iterations and evaluating on 1000 test samples using Logical Form Accuracy (LFAcc), BLEU, and Exact Match (EM) metrics. Fine-tuned T5-Small achieves the highest LFAcc (27.8%), outperforming BART-Small (23.98%) and GPT-2 (20.1%), highlighting encoder-decoder models' superiority in schema-aware SQL generation. Despite resource constraints limiting performance, our pipeline's modularity supports future enhancements, such as advanced schema linking or alternative base models. This work underscores the potential of compact transformers for accessible text-to-SQL solutions in resource-scarce environments.
摘要：文本到SQL翻译使非专家用户可以使用自然语言以及教育和商业智能中的应用程序查询关系数据库。这项研究评估了蜘蛛数据集上的三个轻型变压器模型-T5-Small，Bart-Small和GPT-2-，重点是低资源设置。我们开发了可重复使用的模型无形管道，该管道对每个模型的体系结构进行调整格式，对它们进行1000至5000次迭代训练，并使用逻辑形式的精度（LFACC），BLEU和精确匹配（EM）度量对1000个测试样品进行评估。微调T5-MALL实现了最高的LFACC（27.8％），表现优于Bart-Small（23.98％）和GPT-2（20.1％），突出了Endoder-Decoder模型在架构中意识到的SQL生成中的优势。尽管资源限制了限制性能，但我们的管道的模块化支持未来的增强功能，例如高级模式链接或替代基本模型。这项工作强调了紧凑型变压器在资源筛选环境中可访问的文本到SQL解决方案的潜力。

Title: P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

Authors: Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04626
Pdf URL: https://arxiv.org/pdf/2508.04626
Copy Paste: [[2508.04626]] P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis(https://arxiv.org/abs/2508.04626)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are expected to produce safe, helpful, and honest content during interaction with human users, but they frequently fail to align with such values when given flawed instructions, e.g., missing context, ambiguous directives, or inappropriate tone, leaving substantial room for improvement along multiple dimensions. A cost-effective yet high-impact way is to pre-align instructions before the model begins decoding. Existing approaches either rely on prohibitive test-time search costs or end-to-end model rewrite, which is powered by a customized training corpus with unclear objectives. In this work, we demonstrate that the goal of efficient and effective preference alignment can be achieved by P-Aligner, a lightweight module generating instructions that preserve the original intents while being expressed in a more human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree Search, which systematically explores the space of candidate instructions that are closely tied to human preference. Experiments across different methods show that P-Aligner generally outperforms strong baselines across various models and benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness and efficiency through multiple perspectives, including data quality, search strategies, iterative deployment, and time overhead.
摘要：大型语言模型（LLM）期望在与人类用户互动期间产生安全，乐于助人和诚实的内容，但是当给出有缺陷的说明时，它们经常与此类值保持一致，例如缺失上下文，模棱两可的指令或不适当的音调，为沿多个维度改进的大量空间留出了大量的空间。一种经济高效但高影响力的方法是在模型开始解码之前先进行对准指令。现有的方法要么依赖于过度的测试时间搜索成本，要么端到端模型重写，该模型由具有不清楚目标的定制培训语料库提供支持。在这项工作中，我们证明了P-Aligner可以实现高效有效的偏好对齐方式的目标，P-Aligner是一种轻巧的模块生成指令，以更加人为偏爱的形式表示原始意图。 P-Aligner在Ultraprompt上进行了训练，这是一种使用Monte-Carlo Tree搜索通过建议的原理引导管道合成的新数据集，该搜索系统地探索了与人类偏好紧密相关的候选指令的空间。跨不同方法的实验表明，P-Aligner通常比各种模型和基准的强大基准都优于强大基准，包括GPT-4-Turbo和Gemma-2-Simpo的平均赢率增长率为28.35％和8.69％。进一步分析通过多种观点（包括数据质量，搜索策略，迭代部署和时间开销）来验证其有效性和效率。

Title: IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

Authors: Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04632
Pdf URL: https://arxiv.org/pdf/2508.04632
Copy Paste: [[2508.04632]] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards(https://arxiv.org/abs/2508.04632)
Keywords: language model, gpt, llm
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.
摘要：通过可验证的奖励（RLVR）的强化学习可改善大语模型（LLMS）功能后的指导，但由于难度评估不足而导致的培训效率低下。此外，RLVR容易过度优化，其中LLMS利用验证快捷方式而不与用户指令的实际意图保持一致。 We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; （3）TRIP电线，一种通过陷阱指令检测奖励黑客的诊断机制，触发和捕获捷径的剥削行为。我们的旅行电线显示出奖励率的大量降低。

Title: Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Authors: Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D'Oosterlinck, Christopher Potts, Omar Khattab
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04660
Pdf URL: https://arxiv.org/pdf/2508.04660
Copy Paste: [[2508.04660]] Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs(https://arxiv.org/abs/2508.04660)
Keywords: language model, prompt
Abstract: Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the this http URL optimizer.
摘要：事实证明，小组相对策略优化（GRPO）已被证明是培训后语言模型（LMS）的有效工具。但是，AI系统越来越多地表示为模块化程序，将多个LM调用与不同的及时模板和其他工具混合在一起，尚不清楚如何最好地利用GRPO来改善这些系统。我们开始通过定义MMGRPO来应对这一挑战，MMGRPO是对GRPO的简单多模块概括，该概括分组了LM通过prolouts in prolouts和处理可变长度和中断轨迹的LM调用。我们发现，通过自动及时优化组成的MMGRPO在分类，多跳搜索和保护后培训的LM的委派任务中平均提高了准确性11％，而对迅速优化的授权委托任务则提高了5％。我们在DSPY中开放源MMGRPO作为此HTTP URL优化器。

Title: Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

Authors: Mo Li, L.H. Xu, Qitai Tan, Ting Cao, Yunxin Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04664
Pdf URL: https://arxiv.org/pdf/2508.04664
Copy Paste: [[2508.04664]] Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management(https://arxiv.org/abs/2508.04664)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs' capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs' inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.
摘要：大型语言模型（LLMS）在处理长篇小说是由于主动干扰而遭受了重大的性能降解，在上下文的早期部分中无关的信息会破坏推理和记忆回忆。尽管大多数研究都集中在外部记忆系统上增强LLMS的功能，但我们提出了一种互补方法：使用主动上下文管理（ACM）工具赋予LLMS能力，以积极雕刻其内部工作记忆。我们介绍了雕塑家，该框架将LLMS配备三类工具：（1）上下文碎片，（2）摘要，hide和reterore，以及（3）智能搜索。我们的方法使LLM可以主动管理他们的注意力和工作记忆，类似于人类在过滤分心的同时选择性地关注相关信息。有关信息 - SPARSE基准-PI-LLM（主动干扰）和针架多针推理的实验评估，即使没有特定的培训，雕塑者即使没有特定的训练也可以显着提高性能，从而利用LLMS固有的工具呼叫概括能力。通过实现主动上下文管理，雕塑家不仅减轻了主动的干扰，而且为跨不同长篇小说任务的更可靠的推理提供了认知基础，从而使明确的上下文控制策略，而不仅仅是更大的标记窗口，也是对稳健性的关键。

Title: GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

Authors: Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, Qingcai Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04676
Pdf URL: https://arxiv.org/pdf/2508.04676
Copy Paste: [[2508.04676]] GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay(https://arxiv.org/abs/2508.04676)
Keywords: language model, llm
Abstract: The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at this https URL.
摘要：大语言模型（LLM）的持续学习能力对于推进人工通用情报至关重要。但是，各个领域的持续微调LLM通常遭受灾难性的遗忘，其特征是：1）严重遗忘其一般能力，以及2）先前学到的任务中的急剧性能下降。为了以简单而稳定的方式同时解决这两个问题，我们提出了一般样本重播（Gere），该框架使用常规预读的文本进行有效的反遗嘱。除了重新审视GERE下最普遍的基于重播的实践外，我们进一步利用神经状态引入增强的激活状态约束优化方法，使用基于阈值的边距（TM）损失，该方法在重播学习过程中保持激活状态的一致性。我们是第一个验证一组固定的预收集的一般重播样本足以解决这两个问题的人 - 放弃一般能力，同时促进跨顺序任务的整体性能。确实，前者可以天生促进后者。通过受控的实验，我们将TM与GERE框架下的不同重播策略进行了系统的比较，包括香草标签拟合，通过KL Divergence模仿logit模仿以及通过L1/L2损失的特征模仿。结果表明，TM始终提高性能并表现出更好的鲁棒性。我们的工作为未来有效地重播LLM为道路铺平了道路。我们的代码和数据可在此HTTPS URL上找到。

Title: FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data

Authors: Thibaut Thonet, Germán Kruszewski, Jos Rozen, Pierre Erbacher, Marc Dymetman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04698
Pdf URL: https://arxiv.org/pdf/2508.04698
Copy Paste: [[2508.04698]] FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data(https://arxiv.org/abs/2508.04698)
Keywords: llm
Abstract: LLM-powered conversational assistants are often deployed in a one-size-fits-all manner, which fails to accommodate individual user preferences. Recently, LLM personalization -- tailoring models to align with specific user preferences -- has gained increasing attention as a way to bridge this gap. In this work, we specifically focus on a practical yet challenging setting where only a small set of preference annotations can be collected per user -- a problem we define as Personalized Preference Alignment with Limited Data (PPALLI). To support research in this area, we introduce two datasets -- DnD and ELIP -- and benchmark a variety of alignment techniques on them. We further propose FaST, a highly parameter-efficient approach that leverages high-level features automatically discovered from the data, achieving the best overall performance.
摘要：通常以一种型适当的方式部署了由LLM驱动的对话助手，这无法适应单个用户的偏好。最近，LLM个性化 - 裁缝模型与特定的用户偏好保持一致 - 作为弥合这一差距的一种方式，人们引起了人们的关注。在这项工作中，我们专门针对一个实用而又具有挑战性的环境，在这些环境中，每个用户只能收集一小部分优先注释 - 我们将问题定义为有限数据（PPALLI）的个性化偏好对齐。为了支持该领域的研究，我们介绍了两个数据集 - DND和ELIP-并对它们进行了各种对齐技术。我们进一步提出了一种快速的，一种高度参数效率的方法，它利用从数据自动发现的高级特征，实现最佳的整体性能。

Title: Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Authors: Anushka Yadav, Isha Nalawade, Srujana Pillarichety, Yashwanth Babu, Reshmi Ghosh, Samyadeep Basu, Wenlong Zhao, Ali Nasaeh, Sriram Balasubramanian, Soundararajan Srinivasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04699
Pdf URL: https://arxiv.org/pdf/2508.04699
Copy Paste: [[2508.04699]] Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis(https://arxiv.org/abs/2508.04699)
Keywords: language model, chat
Abstract: The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.
摘要：推理模型的出现及其整合到实用的AI聊天机器人中，导致了解决高级数学，深入搜索和提取性问题的突破，从而回答了需要复杂且多步思维过程的问题。然而，完全理解这些模型为什么缺少多语言模型而造成更多的幻觉。在这项调查性研究中，我们系统地探讨了当代语言模型在多跳问题回答任务上的推理失败。我们介绍了一个新颖的，细微的错误分类框架，该框架检查了跨三个关键维度的故障：所涉及的源文档的多样性和唯一性（“ hop”），捕获相关信息（“覆盖范围”）的完整性以及认知效率低下（“过度思考”）。通过互补的自动指标支持的严格的HU-MAN注释，我们的探索发现了通常以精度为中心的评估隐藏的复杂错误模式。这种调查方法为当前模型的认知局限性提供了更深入的见解，并为提高未来语言建模工作的推理忠诚度，透明度和鲁棒性提供了可行的指导。