2025-07-23

Title: eSapiens's DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs

Authors: Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15863
Pdf URL: https://arxiv.org/pdf/2507.15863
Copy Paste: [[2507.15863]] eSapiens's DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs(https://arxiv.org/abs/2507.15863)
Keywords: gpt, llm, prompt, retrieval-augmented generation
Abstract: We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.
摘要：我们介绍了Derek（知识深度提取和推理引擎）模块，这是一种专门为企业文档问题回答设计的安全可扩展检索的生成管道。该系统由Esapiens设计和实施，摄入异质内容（PDF，Office，Web），将其分成1,000个重叠的块，并将其索引在混合HNSW+BM25商店中。用户查询由GPT-4O完善，通过合并的vector+BM25搜索检索，并用Cohere重新计算，并使用共同播星及时工程通过LLM回答。 langgraph验证者会强制引用重叠，再生答案，直到每个索赔扎根为止。在四个LegalBench子集上，有1000个token的块改进了@50的回忆，大约1 pp，Hybrid+rerank beost precision@10提高了约7 pp；验证者将痕量利用率提高到0.50以上，并将不支持的语句限制为小于3％。所有组件在容器中运行，强制执行端到端TLS 1.3和AES-256。这些结果表明，德里克模块提供了准确，可追溯和生产的文档质量质量质量质量质量质量标准，并提供最小的操作开销。该模块旨在满足企业对安全，可审计和上下文信仰检索的需求，为法律和财务等高风险领域提供了可靠的基准。

Title: Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

Authors: Altynbek Ismailov, Salia Asanova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15868
Pdf URL: https://arxiv.org/pdf/2507.15868
Copy Paste: [[2507.15868]] Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models(https://arxiv.org/abs/2507.15868)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier ("max" to "min"); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three "reasoning-tuned" versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.
摘要：现在，大型语言模型（LLMS）现在在误读单个单词的设置中编写代码可能会破坏安全性或花钱，但我们仍然希望他们忽略杂散的错别字。为了探究有用的鲁棒性结束并开始有害的不敏感性，我们编制了50个leetscode问题，并制作了三个最小的及时扰动，这些扰动应在重要性上有所不同：（i）渐进性下指定删除10％的单词，每个步骤; （ii）词汇翻转交换关键量词（“最大”至“ min”）；（iii）术语通货膨胀用晦涩的技术代替通用名词代替。六个边境模型，包括三个“推理调节”版本，解决了每个突变的提示，并针对原始测试套件检查了其Python输出，以揭示它们是重复使用的基线解决方案还是改编。在11 853代中，我们观察到急剧的双重不对称。即使缺少90％的提示后，模型在85％的情况下仍保持正确，表现出对指定的过度舒适性，但只有54％对单个量化器翻转的反应，而单个量化器会反转任务，而推理调节的变体甚至比其基础较低。行话在介于两者之间，经过56％。因此，当前的LLM模糊了无害的噪声与含义之间的界线 - 更改编辑，通常将两者都视为可忽视。掩盖显着锚点（例如功能名称）可以强制进行评估。我们提倡评估和培训方案，以奖励差异敏感性：在良性噪声下保持稳定，但在语义确实改变时会适应或拒绝。

Title: Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

Authors: Sumit Singh, Rohit Mishra, Uma Shanker Tiwary
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16002
Pdf URL: https://arxiv.org/pdf/2507.16002
Copy Paste: [[2507.16002]] Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation(https://arxiv.org/abs/2507.16002)
Keywords: language model, gpt, chat
Abstract: One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don't incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.
摘要：自然语言处理中的一个主要挑战是指定实体识别（NER），该识别在文本输入中标识和分类。 In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and gpt3.5-turbo），并通过从外部相关上下文中检索到数据的数据，特别是从Wikipedia中检索数据。我们有和没有RA的微调Muril，XLM-R和Llama2-7b。但是，Llama2-70B，Lama3-70B和GPT3.5-Turbo用于几乎没有生成。我们的调查表明，提到的语言模型（LMS）具有检索增强（RA）的表现优于在大多数情况下不包含RA的基线方法。在RA存在的情况下，MURIL和XLM-R的宏F1分别分别为0.69和0.495，分别为0.69和0.495。微调的Llama2-7b的表现优于Llama2-7b。另一方面，未经微调的生成模型也可以通过增强数据来表现更好。 GPT3.5涡轮增压良好的RA；但是，Llama2-70B和Llama3-70B并未采用我们的检索背景。调查结果表明，RA显着提高了性能，尤其是对于低文本数据。这项研究增加了有关如何最好地使用数据增强方法和验证模型来增强NER性能的重要知识，尤其是在资源有限的语言中。

Title: Learning without training: The implicit dynamics of in-context learning

Authors: Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16003
Pdf URL: https://arxiv.org/pdf/2507.16003
Copy Paste: [[2507.16003]] Learning without training: The implicit dynamics of in-context learning(https://arxiv.org/abs/2507.16003)
Keywords: language model, llm, prompt
Abstract: One of the most striking features of Large Language Models (LLM) is their ability to learn in context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in context and not only during training. Specifically, we show under mild simplifying assumptions how a transformer block implicitly transforms a context into a low-rank weight-update of the MLP layer.
摘要：大语言模型（LLM）最引人注目的功能之一是它们在上下文中学习的能力。也就是说，在推理时间，LLM能够学习新模式，而无需在提示中以示例的形式出现这些模式时，即使在训练过程中没有看到这些模式，这些模式也可以更新。可能发生这种情况的机制仍然在很大程度上未知。在这项工作中，我们表明，使用MLP堆叠自我发项层，使变压器块可以根据上下文隐式地修改MLP层的权重。我们通过理论和实验来争辩说，这种简单的机制可能是LLM可以在上下文中学习的原因，而不仅仅是在培训期间学习的原因。具体而言，我们在轻度简化的假设下显示了变压器块如何将上下文转换为MLP层的低级别重量更高。

Title: Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback

Authors: Hannah Rashkin, Elizabeth Clark, Fantine Huot, Mirella Lapata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16007
Pdf URL: https://arxiv.org/pdf/2507.16007
Copy Paste: [[2507.16007]] Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback(https://arxiv.org/abs/2507.16007)
Keywords: llm
Abstract: Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects -- providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.
摘要：LLM可以通过提供有意义的写作反馈来为创意作家提供支持吗？在本文中，我们通过定义新任务，数据集和评估框架来探讨模型生成的写作反馈的挑战和局限性。为了以受控的方式研究模型性能，我们提出了一个新颖的测试集，其中包括1300层，我们损坏了，以故意引入写作问题。我们通过自动评估指标研究了此任务中常用LLM的性能。我们的分析表明，当前的模型在许多方面都具有强大的开箱即用行为 - 提供了特定且主要是准确的写作反馈。但是，模型通常无法确定故事中最大的写作问题，并无法正确决定何时提供关键反馈与积极反馈。

Title: mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages

Authors: Hellina Hailu Nigatu, Min Li, Maartje ter Hoeve, Saloni Potdar, Sarah Chasins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16011
Pdf URL: https://arxiv.org/pdf/2507.16011
Copy Paste: [[2507.16011]] mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages(https://arxiv.org/abs/2507.16011)
Keywords: retrieval-augmented generation
Abstract: Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages Arabic and English for cross-lingual transfer. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.
摘要：知识图代表现实世界实体及其之间的关系。多语言知识图构建（MKGC）是指自动构建或预测多语言环境中知识图的缺失实体的任务。在这项工作中，我们将MKGC任务重新制定为一个问题回答（QA）任务，并介绍MRAKL：基于检索的基于检索的一代（RAG）系统以执行MKGC。我们通过使用主体实体并在问题中链接关系来实现这一目标，并让我们的模型预测尾巴实体作为答案。我们的实验主要关注两种低资源的语言：Tigrinya和Amharic。我们尝试使用高资源的语言阿拉伯语和英语进行跨语性转移。借助BM25猎犬，我们发现基于抹布的方法可以改善无文本设置的性能。此外，我们的消融研究表明，通过理想化的检索系统，MRAKL分别将Tigrinya和Amharic的精度提高了4.92和8.79个百分点。

Title: AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering

Authors: Simon Baeuerle, Max Radyschevski, Ulrike Pado
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16054
Pdf URL: https://arxiv.org/pdf/2507.16054
Copy Paste: [[2507.16054]] AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering(https://arxiv.org/abs/2507.16054)
Keywords: language model, llm, chat
Abstract: In large organisations, knowledge is mainly shared in meetings, which takes up significant amounts of work time. Additionally, frequent in-person meetings produce inconsistent documentation -- official minutes, personal notes, presentations may or may not exist. Shared information therefore becomes hard to retrieve outside of the meeting, necessitating lengthy updates and high-frequency meeting schedules. Generative Artificial Intelligence (genAI) models like Large Language Models (LLMs) exhibit an impressive performance on spoken and written language processing. This motivates a practical usage of genAI for knowledge management in engineering departments: using genAI for transcribing meetings and integrating heterogeneous additional information sources into an easily usable format for ad-hoc searches. We implement an end-to-end pipeline to automate the entire meeting documentation workflow in a proof-of-concept state: meetings are recorded and minutes are created by genAI. These are further made easily searchable through a chatbot interface. The core of our work is to test this genAI-based software tooling in a real-world engineering department and collect extensive survey data on both ethical and technical aspects. Direct feedback from this real-world setup points out both opportunities and risks: a) users agree that the effort for meetings could be significantly reduced with the help of genAI models, b) technical aspects are largely solved already, c) organizational aspects are crucial for a successful ethical usage of such a system.
摘要：在大型组织中，知识主要在会议上共享，这会占用大量的工作时间。此外，频繁的面对面会议产生了不一致的文件 - 正式会议记录，个人笔记，演讲可能会或可能不存在。因此，共享信息在会议之外很难检索，需要冗长的更新和高频会议时间表。诸如大语言模型（LLM）之类的生成人工智能（Genai）模型在口语和书面语言处理方面表现出令人印象深刻的表现。这激发了Genai在工程部门的知识管理中的实际用法：使用Genai转录会议并将异质的其他信息源整合到易于使用的格式中，以进行临时搜索。我们实施了端到端管道，以在概念验证状态下自动化整个会议文档工作流程：记录会议并由Genai创建会议记录。这些可以通过聊天机器人接口更容易搜索。我们工作的核心是在现实世界工程部门测试这种基于Genai的软件工具，并收集有关道德和技术方面的广泛调查数据。这种现实世界设置的直接反馈指出了机遇和风险：a）用户同意，在Genai模型的帮助下，会议的努力可以大大减少，b）技术方面已经在很大程度上解决了，c）组织方面对于成功的道德使用这一系统至关重要。

Title: Deep Researcher with Test-Time Diffusion

Authors: Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16075
Pdf URL: https://arxiv.org/pdf/2507.16075
Copy Paste: [[2507.16075]] Deep Researcher with Test-Time Diffusion(https://arxiv.org/abs/2507.16075)
Keywords: language model, llm, agent
Abstract: Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.
摘要：由大型语言模型（LLM）提供动力的深入研究代理人正在迅速发展。然而，使用通用测试时间缩放算法生成复杂的长期研究报告时，它们的性能通常通常是高原。我们从涉及搜索，推理和修订的循环的人类研究的迭代性质中汲取灵感，我们提出了测试时间扩散深度研究人员（TTD-DR）。这个新颖的框架将研究报告的生成概念化为扩散过程。 TTD-DR通过初步草稿启动了此过程，该过程是一个可更新的骨骼，它是指导研究方向发展的基础。然后，通过“降级”过程对草稿进行迭代完善，该过程通过在每个步骤中包含外部信息的检索机制动态告知。通过应用于代理工作流的每个组成部分的自我进化算法进一步增强了核心过程，从而确保了扩散过程的高质量上下文的产生。以草稿为中心的设计使报告编写过程更加及时，更连贯，同时减少迭代搜索过程中的信息损失。我们证明，我们的TTD-DR在需要密集的搜索和多跳推理的广泛基准上实现了最先进的结果，从而极大地表现了现有的深入研究代理。

Title: The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

Authors: Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, Markus Strohmaier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16076
Pdf URL: https://arxiv.org/pdf/2507.16076
Copy Paste: [[2507.16076]] The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models(https://arxiv.org/abs/2507.16076)
Keywords: language model, llm, prompt
Abstract: Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.
摘要：角色提示越来越多地用于大型语言模型（LLM），以模拟各种社会人口统计学组的观点。但是，如何制定角色提示可以显着影响结果，从而引起人们对此类模拟的保真度的担忧。使用五个开源LLM，我们系统地研究了不同的角色及时策略（特别是角色采用格式和人口启动策略）如何影响开放和封闭式任务中15个相互群体人口组的LLM模拟。我们的发现表明，LLM努力模拟边缘化群体，尤其是非二元，西班牙裔和中东身份，但人口启动和角色采用策略的选择显着影响了他们的刻画。具体来说，我们发现采访格式提示和基于名称的启动可以帮助减少刻板印象并改善对齐方式。令人惊讶的是，诸如Olmo-2-7b之类的较小型号优于大型模型，例如Llama-3.3-70B。我们的发现为设计社会人口统计学角色提示提供了可行的指导，以基于LLM的模拟研究提示。

Title: Efficient Compositional Multi-tasking for On-device Large Language Models

Authors: Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16083
Pdf URL: https://arxiv.org/pdf/2507.16083
Copy Paste: [[2507.16083]] Efficient Compositional Multi-tasking for On-device Large Language Models(https://arxiv.org/abs/2507.16083)
Keywords: language model, llm
Abstract: Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
摘要：适配器参数提供了一种修改机器学习模型行为的机制，并在大型语言模型（LLMS）和生成AI的背景下获得了很大的普及。可以合并这些参数以通过称为任务合并的过程支持多个任务。但是，先前在LLM中合并的工作，尤其是在自然语言处理中，仅限于每个测试示例仅处理一个任务的方案。在本文中，我们专注于设备设置，并研究基于文本的组成多任务的问题，其中每个测试示例都涉及多个任务的同时执行。例如，生成长文本的翻译摘要需要同时解决翻译和摘要任务。为了促进在这种情况下的研究，我们提出了一个包括四个实际相关组成任务的基准。我们还提出了一种针对限制计算资源的设备应用程序量身定制的有效方法（可学习的校准），强调了对资源有效和高性能的解决方案的需求。我们的贡献为在实际多任务方案中推进LLM的功能的基础是扩展其对复杂，资源约束用例的适用性。

Title: Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task

Authors: Jared Moore, Ned Cooper, Rasmus Overmark, Beba Cibralic, Nick Haber, Cameron R. Jones
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16196
Pdf URL: https://arxiv.org/pdf/2507.16196
Copy Paste: [[2507.16196]] Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task(https://arxiv.org/abs/2507.16196)
Keywords: language model, llm, agent
Abstract: Recent evidence suggests Large Language Models (LLMs) display Theory of Mind (ToM) abilities. Most ToM experiments place participants in a spectatorial role, wherein they predict and interpret other agents' behavior. However, human ToM also contributes to dynamically planning action and strategically intervening on others' mental states. We present MindGames: a novel `planning theory of mind' (PToM) task which requires agents to infer an interlocutor's beliefs and desires to persuade them to alter their behavior. Unlike previous evaluations, we explicitly evaluate use cases of ToM. We find that humans significantly outperform o1-preview (an LLM) at our PToM task (11% higher; $p=0.006$). We hypothesize this is because humans have an implicit causal model of other agents (e.g., they know, as our task requires, to ask about people's preferences). In contrast, o1-preview outperforms humans in a baseline condition which requires a similar amount of planning but minimal mental state inferences (e.g., o1-preview is better than humans at planning when already given someone's preferences). These results suggest a significant gap between human-like social reasoning and LLM abilities.
摘要：最近的证据表明，大型语言模型（LLMS）显示心理理论（TOM）能力。大多数汤姆（Tom）实验将参与者置于观众角色，他们在其中预测和解释其他代理人的行为。但是，人类汤姆（Tom）也有助于动态计划行动，并在战略上介入他人的心理状态。我们介绍了思维游戏：一种新颖的“思维计划”（PTOM）任务，要求代理商推断对话者的信念，并渴望说服他们改变其行为。与以前的评估不同，我们明确评估了汤姆的用例。我们发现，在我们的PTOM任务中，人类的表现明显优于O1-preview（LLM）（高11％； $ p = 0.006 $）。我们假设这是因为人类具有其他代理人的隐式因果模型（例如，他们知道，按照我们的任务要求，询问人们的偏好）。相比之下，O1-preview在基线条件下优于人类，该条件需要类似的计划，但精神状态的推断最少（例如，O1-preview在已经赋予某人的偏好的情况下比人类要比计划更好）。这些结果表明类似人类的社会推理和LLM能力之间存在显着差距。

Title: WakenLLM: A Fine-Grained Benchmark for Evaluating LLM Reasoning Potential and Reasoning Process Stability

Authors: Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Yao Wan, Kejia Huang, Zhichao Hou, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16199
Pdf URL: https://arxiv.org/pdf/2507.16199
Copy Paste: [[2507.16199]] WakenLLM: A Fine-Grained Benchmark for Evaluating LLM Reasoning Potential and Reasoning Process Stability(https://arxiv.org/abs/2507.16199)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) frequently output the label \emph{Unknown}, yet current evaluations focus almost exclusively on whether such answers are \emph{honest} rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon \emph{Vague Perception}. And thus we introduce a framework that quantifies the proportion of \emph{Unknown} responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct (\emph{Known}) or intrinsically indeterminate outcomes. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the true reasoning ability of LLMs and providing a new perspective on solving the \emph{Vague Perception} phenomenon.
摘要：大型语言模型（LLMS）经常输出标签\ emph {unknown}，但是当前的评估几乎完全关注此类答案是否为\ emph {诚实}，而不是出现它们的原因。这模糊了两种不同的情况：（i）真正不确定的输入，以及（ii）模型无法解决的可解决问题。我们称这种现象\ emph {模糊的感知}。因此，我们引入了一个框架，该框架量化了可归因于模型无行为能力的\ emph {未知}响应的比例，并测试指导刺激是否可以将它们转换为正确的（\ emph {已知}）或本质上不确定的结果。通过分开这些不确定性来源，我们的方法提供了更清晰的LLM推理限制及其改进的潜力。当我们在不同LLM上获得推理任务的理论准确性时，我们应用不同的方法来测试该模型是否可以在给定基线框架的情况下达到准确性。我们的工作对于探索LLM的真正推理能力并提供了解决\ emph {含糊的感知}现象的新观点。

Title: Towards Compute-Optimal Many-Shot In-Context Learning

Authors: Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16217
Pdf URL: https://arxiv.org/pdf/2507.16217
Copy Paste: [[2507.16217]] Towards Compute-Optimal Many-Shot In-Context Learning(https://arxiv.org/abs/2507.16217)
Keywords: language model, llm, prompt
Abstract: Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.
摘要：长篇小说大语模型（LLMS）能够处理最多包含数百万个令牌的输入。在内在学习范围（ICL）的范围内，这转化为输入提示中的数百/数千个演示，从而启用了许多弹药ICL。实际上，由于（1）高推理成本，（2）缓存和重复使用计算的好处，以及（3）与其他缩放时相比，该策略提供的相似性能相比，通常会在许多镜头设置中随机选择一组固定的演示。在这项工作中，我们提出了两种直接的策略，用于在许多射击ICL中进行演示选择，从而通过最小的计算开销来提高性能。我们的第一种方法结合了少量的演示，这些演示是根据与每个测试样本的相似性选择的，而被缓存的较大随机演示集则不成比例。第二种策略通过用K-Means聚类从测试样本表示的质心替换随机演示来改善第一个策略。我们在几个数据集中使用Gemini Pro和Flash进行的实验表明，我们的策略始终超过随机选择，超越或匹配最性能的选择方法，同时支持缓存和减少推理成本的数量级。我们还表明，根据不同的标准调整选定的示范的比例可以平衡多个ICL的性能和推理成本。

Title: FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents

Authors: Run Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang, Li Zhao, Shan Sun, Zhengwen Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16248
Pdf URL: https://arxiv.org/pdf/2507.16248
Copy Paste: [[2507.16248]] FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents(https://arxiv.org/abs/2507.16248)
Keywords: agent
Abstract: Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, finance, etc. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. Furthermore, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable and robust evaluation; (2) finance oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of tasks in the domain.
摘要：最近，AI代理在智能上迅速发展，并广泛用于专业研究应用程序，例如STEM，软件开发，金融等。在这些AI代理中，深层研究代理是一个关键类别，因为它可以执行长期任务并解决更大复杂性的问题。但是，很少有评估框架和基准系统地自动研究这些研究代理的能力。此外，金融研究问题具有明显的复杂性和微妙的作用。为了填补空白，我们提出了FinresearchBench，这是一个基于逻辑树的代理人，是法官，专门针对金融研究员的目标。它为金融研究领域的7个关键任务类型的研究代理提供了全面的自动评估。这项工作的贡献是双重的：（1）提取研究成果的逻辑树的第一个和创新的代理人，并将其用作中间信息，以提供全面，可靠和强大的评估；（2）以财务为导向的，它涵盖了70个典型的金融研究问题，分布在70种经常遇到的域中的任务类型。

Title: Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

Authors: Hyunji Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16252
Pdf URL: https://arxiv.org/pdf/2507.16252
Copy Paste: [[2507.16252]] Efficient RL for optimizing conversation level outcomes with an LLM-based tutor(https://arxiv.org/abs/2507.16252)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.
摘要：大型语言模型（LLMS）建立在现有的强化学习中，使用人类反馈（RLHF）框架通常会根据立即的转向级人类偏好优化响应。但是，这种方法在多转向对话设置（例如在线数学辅导）中缺乏。我们提出了一种通过代表对话历史来增强基于LLM的导师的方法，该历史记录具有较低维度的潜在状态代表，并优化了长期政策，以确定基于潜在状态的高级行动。目的是使导师的行为与长期目标更好地调整，以指导学生自己解决目标数学问题。我们的模型是轻量级的，需要少于培训导师政策端到端的工作才能直接输出导师的下一个话语的工作要少。我们的实验结果表明，与促使LLM模拟的辅导任务提示相比，这些修饰可改善长期结局。

Title: iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss

Authors: Yujian Sun, Tian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16263
Pdf URL: https://arxiv.org/pdf/2507.16263
Copy Paste: [[2507.16263]] iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss(https://arxiv.org/abs/2507.16263)
Keywords: language model, llm
Abstract: As the Large Language Model (LLM) gains widespread adoption, increasing attention has been given to the challenge of making LLM forget non-compliant data memorized during its pre-training. Machine Unlearning focuses on efficiently erasing sensitive information from LLM under limited computational resources. To advance research in this area, SemEval 2025 Task 4: "Unlearning Sensitive Content from Large Language Models" introduces three unlearning datasets and establishes a benchmark by evaluating both forgetting effectiveness and the preservation of standard capabilities. In this work, we propose a more controllable forgetting loss, Effective Unlearning Loss, and explore its integration with various techniques to achieve more efficient and controlled unlearning. Our system ultimately ranked 5th on the competition leaderboard.
摘要：随着大语言模型（LLM）获得广泛采用，人们对使LLM在预训练期间记住的不合规数据的挑战受到了越来越多的关注。在有限的计算资源下，机器未学习的重点是从LLM有效擦除敏感信息。为了推进该领域的研究，Semeval 2025任务4：“大语言模型的敏感内容”引入了三个未学习的数据集，并通过评估遗忘有效性和保存标准功能来建立基准。在这项工作中，我们提出了更可控制的遗忘损失，有效的学习损失，并探索了其与各种技术的集成，以实现更有效和受控的未学习。我们的系统最终在比赛排行榜上排名第五。

Title: Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

Authors: Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Le Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16271
Pdf URL: https://arxiv.org/pdf/2507.16271
Copy Paste: [[2507.16271]] Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction(https://arxiv.org/abs/2507.16271)
Keywords: language model, llm
Abstract: With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at this https URL.
摘要：随着大语言模型（LLM）的出现，人们期望LLM可以从复杂的现实世界文档中有效提取明确的信息（例如，论文，报告）。 However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable.为了弥合这一差距，我们介绍了安排和有组织的提取基准（AOE），这是一种新的双语基准测试，其中包含数据和文档的不同长度的文档，旨在系统地评估LLM的能力，理解零散的文档并将分离的信息重建分离的信息重建到一个有组织的桌子中。与依靠固定模式和狭窄任务域的常规文本到餐桌的任务不同，AOE包括11个跨三个不同域的精心设计的任务，需要模型来生成针对各种输入查询的特定上下文架构。 In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at this https URL.

Title: Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Authors: Paul-Andrei Pogăcean, Sanda-Maria Avram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16284
Pdf URL: https://arxiv.org/pdf/2507.16284
Copy Paste: [[2507.16284]] Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis(https://arxiv.org/abs/2507.16284)
Keywords: language model
Abstract: The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80\% accuracy on texts shorter than 150 characters and reaches 100\% accuracy for longer texts and older writings. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.
摘要：近年来，围绕语言识别的辩论引起了人们的重新关注，尤其是随着AI驱动语言模型的快速发展。但是，非基于语言识别的方法已经掩盖了。这项研究探讨了通过利用从既定语言研究得出的会标和Bigrams频率排名来实现语言决定论的数学实施。所使用的数据集包括长度，历史时期和流派各不相同的文本，包括短篇小说，童话和诗歌。尽管有这些变化，该方法在短于150个字符的短文本上达到了80 \％的精度，并且对于更长的文本和较旧的著作，该方法的精度达到了100 \％的精度。这些结果表明，基于经典的频率方法仍然是AI驱动模型进行语言检测的有效且可扩展的替代方案。

Title: SpeLLM: Character-Level Multi-Head Decoding

Authors: Amit Ben-Artzy, Roy Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16323
Pdf URL: https://arxiv.org/pdf/2507.16323
Copy Paste: [[2507.16323]] SpeLLM: Character-Level Multi-Head Decoding(https://arxiv.org/abs/2507.16323)
Keywords: llm
Abstract: Scaling LLM vocabulary is often used to reduce input sequence length and alleviate attention's quadratic cost. Yet, current LLM architectures impose a critical bottleneck to this procedure: the output projection layer scales linearly with vocabulary size, rendering substantial expansion impractical. We propose SpeLLM, a method that decouples input and output vocabularies by predicting character-level strings through multiple output heads. In SpeLLM, each of the $k$ linear heads predicts a single character simultaneously, enabling the model to represent a much larger output space using smaller, independent linear heads. We present a self-distillation approach for converting a standard LLM to a SpeLLM. Our experiments with four pre-trained LLMs show their SpeLLM variants achieve competitive performance on downstream tasks while reducing runtime by 5.1% on average across models. Our approach provides a potential avenue for reducing LLM costs, while increasing support for underrepresented languages and domains.
摘要：缩放LLM词汇通常用于减少输入序列长度并减轻注意力的二次成本。然而，当前的LLM体系结构对此过程施加了关键的瓶颈：输出投影层与词汇尺寸线性缩放，使实质性扩张不切实际。我们提出了SpellM，该方法通过通过多个输出头预测字符级字符串来解除输入和输出词汇。在Spellm中，$ K $线性头部中的每一个都可以同时预测一个字符，从而使模型可以使用较小的独立线性头部来代表更大的输出空间。我们提出了一种将标准LLM转换为拼写的自我鉴定方法。我们对四个预训练的LLM的实验表明，他们的拼写变体在下游任务上实现了竞争性能，同时在模型中平均将运行时降低了5.1％。我们的方法为降低LLM成本提供了潜在的途径，同时增加了对代表性不足的语言和领域的支持。

Title: Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Authors: Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16331
Pdf URL: https://arxiv.org/pdf/2507.16331
Copy Paste: [[2507.16331]] Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny(https://arxiv.org/abs/2507.16331)
Keywords: language model, llm, chain-of-thought
Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.
摘要：现有的基于非正式语言的大语模型（LLMS）接受了强化学习训练（RL）面临重大挑战：它们提供至关重要的培训信号的验证过程既不可靠，也不可扩展。实际上，普遍的大型专有模型几乎无法产生可验证的程序。一个有希望的但很大程度上未知的替代方案是正式的基于语言的推理。在严格的正式系统中扎根LLM，其中生成模型在正式的语言空间（例如DAFNY）中运行，可以自动且数学上可证明其推理过程和结果的验证。此功能对于实现大规模，可靠的正式软件验证至关重要。采用人类宣传的思想链和其他人先验来诱导LLM的推理和编码能力是一种普遍的做法。不幸的是，为监督复杂的编程任务提供此类先验，这是不可接受的。在这项工作中，我们系统地探讨了用正式语言达夫尼（Dafny）作为试点研究的主要环境来减少人类先验的方法。我们的管道主要依赖于引入自动可扩展的数据策展管道，以及与正式语言验证者的反馈集成的仔细的RL设计。我们介绍了DafnyComp，这是构图形式计划的基准，该程序具有自动形式的规格，用于规范推理。我们有监督的微调（SFT）阶段也可以使小型模型（例如0.5B）生成句法有效且可验证的DAFNY代码，超过专有模型。随着正则化的RL，进一步提高了性能，实现了对域外任务的更强概括，并且在具有挑战性的DAFNYCOMP基准上表现出色。

Title: GG-BBQ: German Gender Bias Benchmark for Question Answering

Authors: Shalaka Satheesh, Katrin Klug, Katharina Beckh, Héctor Allende-Cid, Sebastian Houben, Teena Hassan
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16410
Pdf URL: https://arxiv.org/pdf/2507.16410
Copy Paste: [[2507.16410]] GG-BBQ: German Gender Bias Benchmark for Question Answering(https://arxiv.org/abs/2507.16410)
Keywords: language model, llm
Abstract: Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model's predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.
摘要：在自然语言处理（NLP）的背景下，公平评估通常与评估偏见和减少相关伤害有关。在这方面，评估通常是通过使用基准数据集来进行的，用于诸如问题回答之类的任务，为测量模型沿各个维度（包括性别身份）的偏差而创建的任务。在我们的工作中，我们使用偏见基准评估了德国大语模型（LLM）的性别偏见，以回答Parrish等人的问题。（2022）作为参考。具体而言，该英语数据集的性别认同子集中的模板被机器转换为德语。然后在语言专家的帮助下手动审查和纠正机器翻译模板中的错误。我们发现，在创建用于性别偏见评估的数据集时，翻译的手动修订至关重要，因为机器翻译从英语翻译到具有语法性别的德语等语言的局限性。我们的最终数据集由两个子集组成：Subset-I，由与性别认同相关的组项和子集II组成，其中组项用适当的名称代替。我们在此新创建的数据集中评估了用于德国NLP的几个LLM，并报告准确性和偏差分数。结果表明，所有模型都表现出偏见，无论是沿现有的社会刻板印象而言。

Title: PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

Authors: Hui Xiang, Jinqiao Shi, Ting Zhang, Xiaojie Zhao, Yong Liu, Yong Ma
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16424
Pdf URL: https://arxiv.org/pdf/2507.16424
Copy Paste: [[2507.16424]] PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning(https://arxiv.org/abs/2507.16424)
Keywords: prompt
Abstract: Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution often diverges significantly from the target distribution, causing the decision boundary to shift away from its optimal position. However, existing methods overlook the role of unlabeled samples in enhancing the empirical distribution to better align with the target distribution, resulting in a suboptimal decision boundary and the selection of samples that inadequately represent the target distribution. To address this, we propose a hybrid AL framework, termed \textbf{PromptAL} (Sample-Aware Dynamic Soft \textbf{Prompts} for Few-Shot \textbf{A}ctive \textbf{L}earning). This framework accounts for the contribution of each unlabeled data point in aligning the current empirical distribution with the target distribution, thereby optimizing the decision boundary. Specifically, PromptAL first leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model's predictive distribution and decision boundary. Subsequently, based on the adjusted decision boundary, it integrates uncertainty estimation with both global and local diversity to select high-quality samples that more accurately represent the target distribution. Experimental results on six in-domain and three out-of-domain datasets show that PromptAL achieves superior performance over nine baselines. Our codebase is openly accessible.
摘要：主动学习（AL）旨在通过选择最有用的标签样本来优化模型培训并降低注释成本。通常，AL方法依赖于标记数据的经验分布来定义决策边界并执行不确定性或多样性估计，随后确定了潜在的高质量样本。在少数拍摄的情况下，经验分布通常与目标分布显着分歧，从而导致决策边界从其最佳位置转移。但是，现有方法忽略了未标记的样品在增强经验分布以更好地与目标分布保持一致的作用，从而导致了次优的决策边界以及选择不充分代表目标分布的样品的选择。为了解决这个问题，我们提出了一个混合框架，称为\ textbf {pightal}（sample-awawawe dynamic soft \ textbf {stripts}，用于几个shot \ textbf {a} ctive \ textbf {l textbf {l}收入）。该框架解释了每个未标记的数据点在将当前的经验分布与目标分布对齐的贡献，从而优化了决策边界。具体而言，及时的第一张利用未标记的数据来构建样品感知的动态软提示，以调整模型的预测分布和决策边界。随后，基于调整后的决策边界，它将不确定性估计与全球和局部多样性都集成在一起，以选择更准确地代表目标分布的高质量样本。六个域内和三个室外数据集的实验结果表明，迅速在九个基础线上取得了卓越的性能。我们的代码库公开可访问。

Title: Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch

Authors: Elza Strazda, Gerasimos Spanakis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16442
Pdf URL: https://arxiv.org/pdf/2507.16442
Copy Paste: [[2507.16442]] Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch(https://arxiv.org/abs/2507.16442)
Keywords: language model
Abstract: Warning: This paper contains explicit statements of offensive stereotypes which might be upsetting. Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.
摘要：警告：本文包含可能令人沮丧的进攻刻板印象的明确陈述。语言模型容易表现出偏见，进一步扩大了不公平和有害的刻板印象。考虑到这些模型的快速流行和广泛的应用，有必要确保安全且公平的模型。截至最近，人们对衡量语言模型的偏见非常关注，但大多数研究仅关注英语。引入了荷兰特定的乌鸦对数据集的荷兰语版本，用于测量荷兰语模型中的偏见。结果数据集由1463个句子对组成，涵盖了9个类别的偏见，例如性取向，性别和残疾。该句子对由对比句子组成，其中一个句子涉及处境不利的群体和其他优势群体。使用荷兰乌鸦对数据集，我们表明各种语言模型，Bertje，Robbert，Multiandual Bert，Geitje和Misstral-7B在各种偏见类别中都具有很大的偏见。使用英语和法语版本的乌鸦对数据集，用英语（Bert和Roberta）和法语（Flaubert and Camembert）语言模型评估了偏见，并且显示英语模型表现出最大的偏见，而荷兰模型最少的偏见。此外，结果还表明，将角色分配给语言模型会改变其表现出的偏见水平。这些发现突出了跨语言和环境的偏见的可变性，这表明文化和语言因素在塑造模型偏见中起着重要作用。

Title: Towards Enforcing Company Policy Adherence in Agentic Workflows

Authors: Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, Ateret Anaby-Tavor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16459
Pdf URL: https://arxiv.org/pdf/2507.16459
Copy Paste: [[2507.16459]] Towards Enforcing Company Policy Adherence in Agentic Workflows(https://arxiv.org/abs/2507.16459)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging $\tau$-bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.
摘要：大型语言模型（LLM）代理人有望在传统业务流程自动化方面具有灵活且可扩展的替代方案，但努力遵循复杂的公司政策。在这项研究中，我们介绍了一个确定性，透明和模块化的框架，用于在代理工作流中执行业务政策。我们的方法分为两个阶段：（1）将策略文档编译到与工具使用相关的可验证的后卫代码中的离线构建时间阶段，以及（2）一个运行时集成，这些警卫可以在每个代理行动之前确保合规性。我们展示了我们关于具有挑战性的$ \ tau $ bamch航空公司领域的方法，显示了令人鼓舞的政策执法结果，并进一步概述了现实世界中部署的关键挑战。

Title: ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs

Authors: Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16488
Pdf URL: https://arxiv.org/pdf/2507.16488
Copy Paste: [[2507.16488]] ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs(https://arxiv.org/abs/2507.16488)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states' update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.
摘要：大型语言模型（LLM）在各种自然语言处理任务上都表现出色，但是它们产生幻觉的趋势破坏了它们的可靠性。利用隐藏状态的现有幻觉检测方法主要集中在静态和孤立的表示上，忽视了它们在跨层的动态演变，这限制了功效。为了解决此限制，我们将重点转移到隐藏状态更新过程中，并引入一个新颖的度量标准，即ICR分数（信息对残差流的信息贡献），该评分量化了模块对隐藏状态更新的贡献。我们从经验上验证了ICR评分在区分幻觉方面是有效且可靠的。在这些见解的基础上，我们提出了一种幻觉检测方法，即ICR探针，该方法捕获了隐藏状态的跨层进化。实验结果表明，ICR探针具有较少的参数可实现出色的性能。此外，消融研究和案例分析为这种方法的基本机制提供了更深入的见解，从而提高了其可解释性。

Title: Combining Language and Topic Models for Hierarchical Text Classification

Authors: Jaco du Toit, Marcel Dunaiski
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16490
Pdf URL: https://arxiv.org/pdf/2507.16490
Copy Paste: [[2507.16490]] Combining Language and Topic Models for Hierarchical Text Classification(https://arxiv.org/abs/2507.16490)
Keywords: language model
Abstract: Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.
摘要：层次文本分类（HTC）是一项自然语言处理任务，其目的是将文本文档分类为预定的结构化类层次结构中的一组类。最近的HTC方法使用各种技术将层次类结构信息与预训练的语言模型（PLM）的自然语言理解能力结合在一起，以提高分类性能。此外，使用主题模型以及PLMS从文本文档中提取功能已被证明是多标签文本分类任务的有效方法。这些特征提取器模型的组合背后的基本原理是，PLM捕获了细粒度的上下文和语义信息，而主题模型获得了高级表示，这些表示将文档的整体视为整个文档。在本文中，我们使用使用PLM和主题模型的HTC方法来从用于训练分类模型的文本文档中提取功能。我们的目标是确定从两个模型中提取的特征的组合是否对HTC性能有益。在我们的方法中，提取的特征通过单独的卷积层传递，这些卷积层的输出合并并传递给标签的注意机制，该机制通过分别称量每个类别的最重要特征来获得特定于标签的文档表示。我们对三个HTC基准数据集进行了全面的实验，并表明使用从主题模型中提取的功能通常会降低分类性能，而与仅使用PLM获得的功能相比。与以前的工作相反，这表明从主题模型中提取的文本分类任务提取的功能不应被认为是有益的。

Title: Learning Text Styles: A Study on Transfer, Attribution, and Verification

Authors: Zhiqiang Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16530
Pdf URL: https://arxiv.org/pdf/2507.16530
Copy Paste: [[2507.16530]] Learning Text Styles: A Study on Transfer, Attribution, and Verification(https://arxiv.org/abs/2507.16530)
Keywords: language model, llm
Abstract: This thesis advances the computational understanding and manipulation of text styles through three interconnected pillars: (1) Text Style Transfer (TST), which alters stylistic properties (e.g., sentiment, formality) while preserving content; (2)Authorship Attribution (AA), identifying the author of a text via stylistic fingerprints; and (3) Authorship Verification (AV), determining whether two texts share the same authorship. We address critical challenges in these areas by leveraging parameter-efficient adaptation of large language models (LLMs), contrastive disentanglement of stylistic features, and instruction-based fine-tuning for explainable verification.
摘要：本文通过三个相互联系的支柱来提高对文本样式的计算理解和操纵：（1）文本样式传输（TST），从而改变了风格属性（例如，情感，形式），同时保留内容；（2）作者归因（AA），通过风格指纹识别文本的作者；（3）作者身份验证（AV），确定两个文本是否共享相同的作者身份。我们通过利用大型语言模型（LLMS）的参数有效改编，风格特征的对比度解开以及基于教学的微调来解决这些领域的关键挑战。

Title: Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

Authors: Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schlüter
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16557
Pdf URL: https://arxiv.org/pdf/2507.16557
Copy Paste: [[2507.16557]] Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language(https://arxiv.org/abs/2507.16557)
Keywords: language model, llm
Abstract: In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.
摘要：近年来，已经提出了各种方法来评估大语言模型（LLMS）中的性别偏见。一个关键挑战在于最初针对英语开发的偏差测量方法的可转移性。这项工作旨在通过在LLMS中介绍五个用于性别偏见评估的德国数据集来为这项研究方面做出贡献。数据集基于建立的性别偏见概念，并可以通过多种方法访问。我们的发现是针对八种多语言LLM模型的报道，揭示了与德语中的性别偏见有关的独特挑战，包括对男性职业术语的模棱两可的解释以及看似中性的名词对性别知觉的影响。这项工作有助于理解跨语言的LLM中的性别偏见，并强调了量身定制的评估框架的必要性。

Title: Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models

Authors: Mohamad Ballout, Serwan Jassim, Elia Bruni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16572
Pdf URL: https://arxiv.org/pdf/2507.16572
Copy Paste: [[2507.16572]] Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models(https://arxiv.org/abs/2507.16572)
Keywords: language model, llm
Abstract: This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.
摘要：本文使用GRASP和INTPHYS 2数据集对直观物理任务进行了最先进的多模式大语言模型（MLLM）的系统评估。我们评估了开源模型Intervl 2.5，QWEN 2.5 VL，LLAVA-onevision和专有的Gemini 2.0 Flash Thinking，发现即使是最新的模型也很难可靠地将物理上的物理上的可靠性与不可信的情景区分开。为了超越性能指标，我们对模型嵌入进行探测分析，在关键处理阶段提取中间表示，以检查与任务相关的信息的保留程度。我们的结果表明，根据任务难度，可能会出现批判性视力语言未对准：视觉编码者成功捕获物理合理性线索，但是语言模型并未有效地利用此信息，从而导致推理失败。这种错位表明，直观物理任务中MLLM的主要局限性不是视觉组成部分，而是视觉和语言信息的无效整合。我们的发现重点介绍了视觉语言对准作为改进的关键领域，为未来的MLLMS开发提供了见解。

Title: Step-Audio 2 Technical Report

Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.16632
Pdf URL: https://arxiv.org/pdf/2507.16632
Copy Paste: [[2507.16632]] Step-Audio 2 Technical Report(https://arxiv.org/abs/2507.16632)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit this https URL for more information.
摘要：本文介绍了Step-Adio〜2，这是一种端到端的多模式大型语言模型，旨在行业强度的音频理解和语音对话。通过整合潜在的音频编码器和以推理为中心的增强学习（RL），Step-Adio 2可以在自动语音识别（ASR）和音频理解中实现有希望的表现。为了促进真正的端到端语音对话，Step-Audio 2将离散的音频令牌的产生结合到语言建模中，从而大大提高了其对副语言信息（例如说话风格和情感）的响应能力。为了有效利用实际数据中丰富的文本和声学知识，Step-Audio 2集成了检索功能增强的生成（RAG），并能够调用诸如Web搜索之类的外部工具以减轻幻觉和音频搜索来切换时间。 Step-Audio 2经过数百万小时的语音和音频数据的培训，在各种会话方案中提供了情报和表现力。评估结果表明，与其他开源和商业解决方案相比，Step-Adio 2在各种音频理解和对话基准方面取得了最新的性能。请访问此HTTPS URL以获取更多信息。

Title: Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

Authors: Armin Berger, Lars Hillebrand, David Leonhard, Tobias Deußer, Thiago Bell Felix de Oliveira, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Christian Bauckhage, Rafet Sifa
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16642
Pdf URL: https://arxiv.org/pdf/2507.16642
Copy Paste: [[2507.16642]] Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models(https://arxiv.org/abs/2507.16642)
Keywords: language model, gpt, llm
Abstract: The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.
摘要：财务文件的审核是历史上是劳动密集型过程，它是转型的悬崖。 AI驱动的解决方案通过建议从财务报告中相关的文本段落来符合会计标准的法律要求，从而促进了简化这一过程。但是，仍然存在一个明显的限制：这些系统通常在验证建议的摘录是否确实符合特定的法律授权时差不多。因此，在本文中，我们探讨了在不同模型配置的监管合规性领域中公开可用的大语言模型（LLM）的效率。我们特别强调比较诸如Llama-2之类的尖端开源LLM与Openai的GPT型号等专有对应物。这种比较分析利用了我们合作伙伴PriceWaterhouseCoopers（PWC）德国提供的两个自定义数据集。我们发现，开源骆驼270亿款模型在检测不合规或真正的负面事件方面表现出杰出的表现，击败了所有专有的同行。然而，诸如GPT-4之类的专有模型在各种方面，尤其是在非英语环境中的最佳状态。

Title: P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs

Authors: Dongjun Jang, Youngchae Ahn, Hyopil Shin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16656
Pdf URL: https://arxiv.org/pdf/2507.16656
Copy Paste: [[2507.16656]] P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs(https://arxiv.org/abs/2507.16656)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.
摘要：这项研究探讨了基于文本的大语言模型（LLM）中语音推理的潜力。利用语音基准基准，我们评估了诸如押韵单词生成，G2P转换和音节计数之类的任务。我们在12个LLM中的评估表明，尽管很少有学习的学习能力不一致，但引入了一种新颖的教学动机参与式思想链（P-COT）提示，该提示（P-COT）提示是基于脚手架和发现学习等教育理论，始终如一地增强了表现。该方法利用结构化的指导来激活潜在的语音能力，在某些任务中达到高达52％的改善，甚至超过人类基础。未来的工作可能旨在优化针对特定模型的P-COT提示，或者在不同语言领域探索其应用程序。

Title: Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

Authors: Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Lin, Yingya Zhang, Shiwei Zhang, Difan Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16663
Pdf URL: https://arxiv.org/pdf/2507.16663
Copy Paste: [[2507.16663]] Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs(https://arxiv.org/abs/2507.16663)
Keywords: llm, prompt
Abstract: Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model's own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.
摘要：尽管努力在单个模型中统一多模式的生成和理解任务，但我们显示了这些MLLM表现出自相矛盾，其中一代人会根据模型的理解产生被视为未经输入提示的图像。我们定义了量化这种自相矛盾的未统一分数。我们的经验结果表明，自相矛盾的主要是由于薄弱的一代而不是与提示不符而不是误解的薄弱一代。这种能力不对称表明，利用自相矛盾的自我完善的潜力，在这种能力上，更强大的模型理解指导较弱的产生来减轻产生的理解差距。应用标准的训练后方法（例如SFT，DPO）成功地改善了发电和统一。我们在仅微调生成分支时发现了对产生和理解的共同改进效应，这是一种在训练前已知的现象，但在训练后却没有被忽视。我们的分析表明，改进源于对以前错误地识别为及时一致的假阳性的更好检测。从理论上讲，我们显示了生成和理解之间的一致性训练动力学，使迅速降低的世代还可以改善理解分支中的不匹配检测。此外，该框架揭示了在不良监督下共同降低的潜在风险 - 在我们的实验中经验验证的现象被忽略的现象。值得注意的是，我们发现诸如未统一得分之类的固有指标无法将共同降解与共同改进区分开，这突出了数据质量检查的必要性。最后，我们根据我们的发现提出了一种基于课程的策略，随着模型的改进，逐渐引入了更难的样本，从而可以更好地统一并改善MLLM的生成和理解。

Title: PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

Authors: Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.16679
Pdf URL: https://arxiv.org/pdf/2507.16679
Copy Paste: [[2507.16679]] PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization(https://arxiv.org/abs/2507.16679)
Keywords: language model, llm, prompt
Abstract: In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.
摘要：内在的学习表现出很大的潜力，可以使大语模型（LLM）与人类价值观保持一致，有助于减少有害产量并适应多样化的偏好，而无需昂贵的训练后，被称为内部上下文对齐（ICA）。但是，LLMS对输入提示的理解仍然不可知，限制了ICA解决价值紧张的能力 - 人类价值观本质上是多元化的，通常构成冲突的要求，例如刺激与传统。因此，当前的ICA方法面临着指令瓶颈挑战，在该指示中，LLM努力在单个提示中调和多个预期值，从而导致不完整或有偏见。为了解决这个问题，我们提出了一种新颖的多元化ICA方法Picaco。没有微调，Picaco优化了一个元指导，该元指导可导航多个值，以更好地引起LLMS对它们的理解并改善其对齐方式。这是通过最大化指定值和LLM响应之间的总相关性来实现的，理论上增强了值的相关性，同时降低了分散注意力的噪声，从而产生了有效的值指令。对五个值集的广泛实验表明，Picaco与Black-Box和开源LLM均表现良好，表现优于最近的几个强大基线，并且在多达8个不同的值中取得了更好的平衡。

Title: Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance

Authors: Lars Hillebrand, Armin Berger, Daniel Uedelhoven, David Berghaus, Ulrich Warning, Tim Dilmaghani, Bernd Kliem, Thomas Schmid, Rüdiger Loitz, Rafet Sifa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16711
Pdf URL: https://arxiv.org/pdf/2507.16711
Copy Paste: [[2507.16711]] Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance(https://arxiv.org/abs/2507.16711)
Keywords: language model, llm, chat, retrieval augmented generation
Abstract: Risk and Quality (R&Q) assurance in highly regulated industries requires constant navigation of complex regulatory frameworks, with employees handling numerous daily queries demanding accurate policy interpretation. Traditional methods relying on specialized experts create operational bottlenecks and limit scalability. We present a novel Retrieval Augmented Generation (RAG) system leveraging Large Language Models (LLMs), hybrid search and relevance boosting to enhance R&Q query processing. Evaluated on 124 expert-annotated real-world queries, our actively deployed system demonstrates substantial improvements over traditional RAG approaches. Additionally, we perform an extensive hyperparameter analysis to compare and evaluate multiple configuration setups, delivering valuable insights to practitioners.
摘要：高度监管的行业中的风险和质量（R＆Q）保证需要持续的复杂监管框架进行导航，员工处理大量日常查询，要求准确的政策解释。依靠专业专家的传统方法创造了操作瓶颈并限制可扩展性。我们提出了一种利用大型语言模型（LLM），混合搜索和相关性提升以增强R＆Q查询处理的新型增强生成（RAG）系统。对124个专家注销的现实查询进行了评估，我们的积极部署系统表明，对传统的抹布方法进行了实质性改进。此外，我们进行了广泛的超参数分析，以比较和评估多个配置设置，从而为从业者提供了宝贵的见解。

Title: RAVine: Reality-Aligned Evaluation for Agentic Search

Authors: Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.16725
Pdf URL: https://arxiv.org/pdf/2507.16725
Copy Paste: [[2507.16725]] RAVine: Reality-Aligned Evaluation for Agentic Search(https://arxiv.org/abs/2507.16725)
Keywords: llm, agent
Abstract: Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at this https URL.
摘要：代理搜索是一种更自主和自适应的检索扩展范式，它正在推动智能搜索系统的发展。但是，现有的评估框架无法很好地与代理搜索的目标保持一致。首先，当前基准中常用的复杂查询通常偏离现实的用户搜索方案。其次，先前的方法在提取地面真相进行端到端评估时倾向于引入噪音，从而导致细分水平的扭曲评估。第三，大多数当前框架仅着眼于最终答案的质量，忽略了对代理搜索固有的迭代过程的评估。为了解决这些局限性，我们提出了Ravine，这是一个通过搜索的代理LLM的现实一致的评估框架。 Ravine的目标是多点查询和长格式答案，可以更好地反映用户的意图，并引入了可归因的地面真理构建策略，以提高细粒度评估的准确性。此外，Ravine在整个迭代过程中研究了模型与搜索工具的互动，并说明了效率的因素。我们使用Ravine进行了一系列模型，并获得了多种见解，我们希望这将有助于推进代理搜索系统的开发。代码和数据集可在此HTTPS URL上找到。

Title: Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

Authors: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16784
Pdf URL: https://arxiv.org/pdf/2507.16784
Copy Paste: [[2507.16784]] Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning(https://arxiv.org/abs/2507.16784)
Keywords: language model, llm
Abstract: To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.
摘要：为了打破瓶颈推理的准确性和效率的大型语言模型（LLMS）的上下文限制，我们提出了线程推理模型（TIM），该家族是一个培训了递归和分解性问题解决问题的LLM家族，以及Timrun，一种推理跑步时间，启用了长途锻炼的结构性推理超出上下文的限制。 Tim在Timrun上托管的Tim在单语言模型推理中几乎支持无限的工作记忆和多跳工具调用，克服输出限制，位置限制限制和GPU-MEMORY BOTTLENECKS。通过将自然语言建模为通过长度和深度而不是线性序列测量的推理树来实现性能。 The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning.实验结果表明，即使在GPU内存中最多可操纵KV高速缓存的90％时，我们的系统仍具有较高的推理吞吐量。它还在数学任务上提供了准确的推理，并处理需要长马推理和多跳工具使用的信息检索挑战。

Title: Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

Authors: Xiaoyu Zhan, Xinyu Fu, Hao Sun, Yuanqi Li, Jie Guo, Yanwen Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16799
Pdf URL: https://arxiv.org/pdf/2507.16799
Copy Paste: [[2507.16799]] Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent(https://arxiv.org/abs/2507.16799)
Keywords: language model, llm, prompt, agent
Abstract: The rapid advancement of large language models (LLMs) has enabled role-playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well-known fictional or public figures. On the other hand, fine-tuning-based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test-Time-Matching (TTM), a training-free role-playing framework through test-time scaling and context engineering. TTM uses LLM agents to automatically decouple a character's features into personality, memory, and linguistic style. Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing. It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.
摘要：大型语言模型（LLM）的快速发展使角色扮演语言代理在各种应用中都具有巨大的潜力。但是，仅依靠提示和上下文输入通常不足以使沉浸在特定角色，尤其是众所周知的虚构或公众人物中。另一方面，由于与数据收集和培训所需的计算资源相关的挑战，基于微调的方法面临局限性，从而限制了其更广泛的适用性。为了解决这些问题，我们提出了测试时间匹配（TTM），这是一个通过测试时间缩放和上下文工程的无培训角色扮演框架。 TTM使用LLM代理自动将角色的特征分解为个性，记忆和语言风格。我们的框架涉及一个结构化的三阶段生成管道，该管道利用这些功能来控制角色扮演。它达到了高保真的角色扮演表现，还可以使各种语言风格甚至人格和记忆的变化跨越无缝组合。我们通过人类评估评估我们的框架，结果表明，我们的方法在产生表现力和风格一致的角色对话方面取得了出色的性能。

Title: Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Authors: Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wang Wei, Peng Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16802
Pdf URL: https://arxiv.org/pdf/2507.16802
Copy Paste: [[2507.16802]] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning(https://arxiv.org/abs/2507.16802)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) demonstrate tremendous potential in the financial domain, yet existing models often fall short in scenarios demanding robust reasoning capabilities, stringent trustworthiness requirements, and efficient adaptation to task-specific needs. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task taxonomy with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage learning processes, and detailed attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including FinEva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications.
摘要：大型语言模型（LLMS）在金融领域表现出巨大的潜力，但是现有模型通常在要求强大的推理能力，严格的可信赖性要求以及有效适应特定于任务需求的情况下缺乏。我们介绍了基于QWEN3基础模型的专门设计的Agent-Fin-R1系列金融语言模型（8B和32B参数），以增强对财务应用程序的推理能力，可靠性和域专业化。我们的优化方法将高质量的系统财务任务分类法与全面的多层可信度保证框架相结合。该框架包括高质量的值得信赖的知识工程，多代理可信赖的数据综合和严格的数据验证治理。通过标签引导的自动化困难意识优化，拖车阶段学习过程和详细的归因系统，我们在培训效率方面实现了实质性提高。我们的模型对包括Fineva，FineVal和Financeiq在内的主流财务基准以及一般推理数据集（如Math-500和GPQA）进行了全面评估。为了彻底评估现实世界的部署能力，我们创新提出了Finova评估基准，该基准的重点是代理级财务推理和合规性验证。实验结果表明，Agent-Fin-R1不仅在财务任务上实现最先进的绩效，而且还具有出色的一般推理能力，从而证实了其作为高风险财务应用的可信赖解决方案的有效性。

Title: LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Authors: Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Shu-Kai Hsieh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16809
Pdf URL: https://arxiv.org/pdf/2507.16809
Copy Paste: [[2507.16809]] LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs(https://arxiv.org/abs/2507.16809)
Keywords: language model, llm, agent
Abstract: We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.
摘要：我们提出了Lingbench ++，这是一种语言知名的基准和推理框架，旨在评估受国际语言学奥林匹克（IOL）启发的复杂语言任务的大型语言模型（LLM）。与仅关注最终答案准确性的先前基准分析不同，Lingbench ++提供结构化的推理轨迹，逐步评估协议以及丰富的类型元数据，遍及90多种低资源和跨文化语言。我们进一步开发了整合语法知识检索，具有工具增强推理和故意假设检验的多机构体系结构。通过对基线的系统比较和我们提出的代理模型，我们证明了配备了外部知识源和迭代推理的模型在精度和可解释性方面都超过了单通道方法。 Lingbench ++为LLM中的语言基础，文化知识和认知合理的推理提供了全面的基础。